#Data Manipulation With Python 

Manipulating data here is not manipulating data, not making data inconsistent with its original value. But Data Manipulation is here to make data easier when analyzed by machines. This is done to prevent garbage in garbage out (GIGO) and produce higher quality data. In addition, it also makes data more informative for decision making.

First, import the required libraries.
Here using pandas and numpy libraries

In [2]:
import pandas as pd
import numpy as np

In the pandas library, it has two objects, namely series and data frames

#Object Series

The object series has one dimension. It doesn't have a column name because it only has one column. And has an index.

Example:

In [5]:
age=[30,45,78,1,18,89]

How to convert data into series form?

Example of using the previous data:

In [6]:
age=pd.Series(age)

In [7]:
age

0    30
1    45
2    78
3     1
4    18
5    89
dtype: int64

How to Convert data into array?

Example of using the previous data:

In [8]:
age.values

array([30, 45, 78,  1, 18, 89])

Show index

The index is a range, where the starting point is inclusive of the range and the stop point is exclusive to the range.

In [9]:
age.index

RangeIndex(start=0, stop=6, step=1)

The implicit index is the default index. We can define the index, this is called the explicit index i.e. the index being defined. When defining an index, the number of indexes must equal the number of data.

Example:

In [11]:
age=pd.Series([30,45,78,1,18,89], index=['a', 'b', 'c', 'd', 'e', 'f'])

In [12]:
age

a    30
b    45
c    78
d     1
e    18
f    89
dtype: int64

Call data using explicit index

In [13]:
age['b']

45

Even if we have created an explicit index, we can still call the implicit index.

In [14]:
age[1]

45

When the implicit index and the explicit index are the same. When we call the data, it will only rely on the explicit index.

Example:

In [19]:
age2=pd.Series([30,45,78,1,18,89], index=[2,1,6,3,4,5])

In [20]:
age2

2    30
1    45
6    78
3     1
4    18
5    89
dtype: int64

Call using explicit index.

In [21]:
age2[6]

78

Call using implicit index.

In [18]:
age2[0]

KeyError: ignored

Expalantion : The result will be an error, because it can only assume an explicit index.

#loc and iloc

With loc and iloc you can do practically any data selection operation on DataFrames you can think of. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index.

Example:

In [22]:
age2=pd.Series([30,45,78,1,18,89], index=[2,1,6,3,4,5])

When we access one index then what appears is the explicit index

In [23]:
age2[6]

78

Explanation : when the implicit index is called there will be an inconsistency as in the case above.

loc and iloc are used to resolve inconsistencies.

loc is for calling explicit index.

iloc is for calling implicit index.

In [24]:
#loc
age2.loc[6] #selecting indeks explicit

78

In [25]:
age2.loc[2:6] #slicing indeks explicit

2    30
1    45
6    78
dtype: int64

In [26]:
#iloc

age2.iloc[3] #selecting indeks implicit

1

In [27]:
age2.iloc[1:4] #slicing indeks implicit

1    45
6    78
3     1
dtype: int64

#Data Frame

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

In [28]:
dict_age={'Sharon': 25,
          'Alex':20,
          'Kate':35,
          'Justin':27,
          'Benx':49}
dict_age

{'Alex': 20, 'Benx': 49, 'Justin': 27, 'Kate': 35, 'Sharon': 25}

In [29]:
ages=pd.Series(dict_age)

In [33]:
ages

Sharon    25
Alex      20
Kate      35
Justin    27
Benx      49
dtype: int64

In [34]:
ages.loc['Kate']

35

In [35]:
ages.iloc[2]

35

In [36]:
dict_location={'Sharon': 'California',
               'Alex': 'New York',
               'Kate': 'Los Angeles',
               'Justin': 'Canada',
               'Benx': 'Washington'}
location=pd.Series(dict_location)

In [37]:
location

Sharon     California
Alex         New York
Kate      Los Angeles
Justin         Canada
Benx       Washington
dtype: object

In [38]:
people=pd.DataFrame({'age':ages, 'location':location})

In [39]:
people

Unnamed: 0,age,location
Sharon,25,California
Alex,20,New York
Kate,35,Los Angeles
Justin,27,Canada
Benx,49,Washington


Added a new column.

Syntax for adding column

In [40]:
people['Weight']=people['age']*1.5

Displaying the results

In [41]:
people

Unnamed: 0,age,location,Weight
Sharon,25,California,37.5
Alex,20,New York,30.0
Kate,35,Los Angeles,52.5
Justin,27,Canada,40.5
Benx,49,Washington,73.5


Add a new line.

Syntax for adding line

In [60]:
people_add=pd.DataFrame({'Claire':[57,'Colorado',56]})

Displaying the results

In [61]:
people_add

Unnamed: 0,Claire
0,57
1,Colorado
2,56


In [62]:
people_add=people_add.T

In [63]:
people_add

Unnamed: 0,0,1,2
Claire,57,Colorado,56


In [64]:
people_add.columns=people.columns

Display the results in the data

In [65]:
people_add

Unnamed: 0,age,location,Weight
Claire,57,Colorado,56


Combine with other data

In [66]:
pd.concat([people,people_add])

Unnamed: 0,age,location,Weight
Sharon,25,California,37.5
Alex,20,New York,30.0
Kate,35,Los Angeles,52.5
Justin,27,Canada,40.5
Benx,49,Washington,73.5
Claire,57,Colorado,56.0
