
# Pandas - Series

Series is a type of data type in Pandas similar to NumPy arrays, however, a Series can hold values not limited to number (as in NumPy) and also Series object can be indexed by label, which through exmples you will know what this means.


In [1]:
import numpy as np
import pandas as pd
import math

### Creating a Series from other objects such as List, Array and dictionary.


In [2]:
index1 = ['i1','i2','i3','i4']
columns1 = ['c1','c2','c3','c4']

In [3]:
list1 = list(range(0,12,3))
print(len(list1))
s1 = pd.Series(list1) # if index argument is not given pandas creates default indices
print(type(s1))
s1

4
<class 'pandas.core.series.Series'>


0    0
1    3
2    6
3    9
dtype: int64

In [4]:
pd.Series(data=list1, index= index1)  # specifying the index

i1    0
i2    3
i3    6
i4    9
dtype: int64

In [5]:
array1 = np.array(range(0,8,2))
print(len(array1))
s1 = pd.Series(array1, index=index1)
s1

4


i1    0
i2    2
i3    4
i4    6
dtype: int32

In [6]:
dict1 = {'k1':math.pi,'k2':10,'k3':math.e, 'k4':'Hi'}
print(len(dict1))
s1 = pd.Series(dict1)
s1

4


k1    3.14159
k2         10
k3    2.71828
k4         Hi
dtype: object

## Indexing


In [7]:
index1 = ['i1','i2','i3','i4']

dict1 = {'k1':math.pi,'k2':10,'k3':math.e, 'k4':'Hi'}
s1 = pd.Series(dict1)#, index= index1)
array1 = np.array(range(0,8,2))
s2 = pd.Series(array1, index= index1)                              

In [8]:
s1

k1    3.14159
k2         10
k3    2.71828
k4         Hi
dtype: object

In [9]:
s2                     

i1    0
i2    2
i3    4
i4    6
dtype: int32

In [10]:
s1['k1']

3.141592653589793

# Pandas DataFrames
In DataFrames series object are complied together.


In [11]:
data = np.random.randn(5,4)
# if you use this everytime you run this cell you will get the different values.
data

array([[-0.95901707, -0.50974631,  0.36528983,  0.78773318],
       [-1.84657492, -0.91122521, -0.70213615,  0.55589374],
       [ 1.34021825,  0.15806095,  1.02692674,  0.58425484],
       [-1.27517679, -1.48881001,  0.12095165,  0.23353478],
       [-1.24054756, -2.13635095, -0.50062258,  1.62229752]])

For the purpose of result comparison with your fellow students or even your own work you can create a seed() as shown below.

In [12]:
columns1 = ['c1','c2','c3','c4']
index1 = ['i1','i2','i3','i4']
np.random.seed(0)  # this will fix your random numbers. Not so much random now!!
data = np.random.randn(4,4)
df = pd.DataFrame(data= data, index= index1, columns= columns1)
df

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,-0.977278,0.950088,-0.151357
i3,-0.103219,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


# Indexing and Slicing a DataFrame
i.e. selecting rows, columns or combination

In [13]:
# slice the first column
#df[c1]  will give you error

In [14]:
# how about now
df['c1']

i1    1.764052
i2    1.867558
i3   -0.103219
i4    0.761038
Name: c1, dtype: float64

In [15]:
df[['c1' , 'c2']]  # pay attention to comma and quatation marks

Unnamed: 0,c1,c2
i1,1.764052,0.400157
i2,1.867558,-0.977278
i3,-0.103219,0.410599
i4,0.761038,0.121675


In [16]:
df[columns1]

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,-0.977278,0.950088,-0.151357
i3,-0.103219,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


In [17]:
df[columns1[0:3]]

Unnamed: 0,c1,c2,c3
i1,1.764052,0.400157,0.978738
i2,1.867558,-0.977278,0.950088
i3,-0.103219,0.410599,0.144044
i4,0.761038,0.121675,0.443863


As said earlier each column of a DataFrame is a Series object try the following

In [18]:
type(df['c1'])

pandas.core.series.Series

In [19]:
columns1 = ['c1','c2','c3','c4']
index1 = ['i1','i2','i3','i4']
np.random.seed(0)  # this will fix your random numbers. Not so much random now!!
data = np.random.randn(4,4)
df = pd.DataFrame(data= data, index= index1, columns= columns1)
df

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,-0.977278,0.950088,-0.151357
i3,-0.103219,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


Selecting by rows

In [20]:
df.loc['i1']

c1    1.764052
c2    0.400157
c3    0.978738
c4    2.240893
Name: i1, dtype: float64

In [21]:
df.loc[['i1', 'i2']]  # pay attention to the extra blackets

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,-0.977278,0.950088,-0.151357


In [22]:
df.iloc[0:3]  # using row index numbers. Note the uuper limit is not included

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,-0.977278,0.950088,-0.151357
i3,-0.103219,0.410599,0.144044,1.454274


Selecting by rows and clomuns using the index system

In [23]:
df.iloc[0:2, 1:3]  

Unnamed: 0,c2,c3
i1,0.400157,0.978738
i2,-0.977278,0.950088


conditional selection

In [24]:
df[df>=0]

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,,0.950088,
i3,,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


In [25]:
df[df>(df['c2'].max())]


Unnamed: 0,c1,c2,c3,c4
i1,1.764052,,0.978738,2.240893
i2,1.867558,,0.950088,
i3,,,,1.454274
i4,0.761038,,0.443863,


In [26]:
df.reset_index(drop=True, inplace=False)

Unnamed: 0,c1,c2,c3,c4
0,1.764052,0.400157,0.978738,2.240893
1,1.867558,-0.977278,0.950088,-0.151357
2,-0.103219,0.410599,0.144044,1.454274
3,0.761038,0.121675,0.443863,0.333674


drop=True drops the original indices, and inplace=True makes any changes permanent to the DataFrame
so be carefull what you do to your data!

# Excercise
create a new set of indexes and replace the current index column by your new index colum.
hint: use 
.set_index() 
method

# Data Engineering
we will use Sigmoid function to create a new column from another column

Wikipedia:
"Many natural processes, such as those of complex system learning curves, exhibit a progression from small beginnings that accelerates and approaches a climax over time. When a specific mathematical model is lacking, a sigmoid function is often used."

First lets start easy

In [27]:
df['c5'] = df['c4']*2
df

Unnamed: 0,c1,c2,c3,c4,c5
i1,1.764052,0.400157,0.978738,2.240893,4.481786
i2,1.867558,-0.977278,0.950088,-0.151357,-0.302714
i3,-0.103219,0.410599,0.144044,1.454274,2.908547
i4,0.761038,0.121675,0.443863,0.333674,0.667349


Now drop the new column for good using the argument inplace=True

In [28]:
df.drop('c5', axis= 1,inplace=True)
df

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,-0.977278,0.950088,-0.151357
i3,-0.103219,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


Create the funtion

In [29]:
# first create a function
def func1 (argument):
    out= 1/ (1 + np.exp(-argument))  # Sigmoid function
    return(out)

In [30]:
func1(2) # test it

0.88079707797788231

Using Lambda trick to apply our function to all values in a column

In [31]:
df.columns

Index(['c1', 'c2', 'c3', 'c4'], dtype='object')

In [32]:
 df['Eng_c4'] = df['c4'].apply(lambda x: func1(x))

In [33]:
df

Unnamed: 0,c1,c2,c3,c4,Eng_c4
i1,1.764052,0.400157,0.978738,2.240893,0.903862
i2,1.867558,-0.977278,0.950088,-0.151357,0.462233
i3,-0.103219,0.410599,0.144044,1.454274,0.810655
i4,0.761038,0.121675,0.443863,0.333674,0.582653


In [34]:
df['Eng2'] = 2*df['c1'] + df['Eng_c4']
df

Unnamed: 0,c1,c2,c3,c4,Eng_c4,Eng2
i1,1.764052,0.400157,0.978738,2.240893,0.903862,4.431967
i2,1.867558,-0.977278,0.950088,-0.151357,0.462233,4.197349
i3,-0.103219,0.410599,0.144044,1.454274,0.810655,0.604218
i4,0.761038,0.121675,0.443863,0.333674,0.582653,2.104729


# Cleaning your DataFrame from NAN values

Lets recreate our old dataframe and put some NAN values in it and call it a new DataFrame


In [35]:
columns1 = ['c1','c2','c3','c4']
index1 = ['i1','i2','i3','i4']
np.random.seed(0)  # this will fix your random numbers. Not so much random now!!
data = np.random.randn(4,4)
df = pd.DataFrame(data= data, index= index1, columns= columns1)
df

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,-0.977278,0.950088,-0.151357
i3,-0.103219,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


In [36]:
columns1 = ['c1','c2','c3','c4']
index1 = ['i1','i2','i3','i4']
np.random.seed(0)  # this will fix your random numbers. Not so much random now!!
data = np.random.randn(4,4)
df = pd.DataFrame(data= data, index= index1, columns= columns1)
df

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,-0.977278,0.950088,-0.151357
i3,-0.103219,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


In [37]:
df2 = df[df>(df['c2'].mean())]
df2

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,,0.950088,
i3,,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


First let's use a bunch of methods to see whether the DataFrame that someone gave us has any NAN values in it

In [38]:
df.isnull().values.any()  

False

In [39]:
df2.isnull().values.any()  

True

In [40]:
df2.isnull().any()

c1     True
c2     True
c3    False
c4     True
dtype: bool

In [41]:
sum(df2.isnull().any())

3

So, there are three NAN values in your df2 DataFrame.
you can either remove any rows or columns with NAN in them. Or you can replace them with a value.
the decision depends on your knowledge of the dataset and your authority to make any changes.
In any case, you must record your changes and report to the data provider for further investigation.

# Remove columns with NAN

WARNNING: if you use  inplace=True your original DataFrame will be changed!

In [42]:
df_clean_cols = df2.dropna(axis=1, inplace=False)
df_clean_cols

Unnamed: 0,c3
i1,0.978738
i2,0.950088
i3,0.144044
i4,0.443863


Remove rows with NAN

In [43]:
df_clean_rows = df2.dropna(axis=0, inplace=False)
df_clean_rows

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i4,0.761038,0.121675,0.443863,0.333674


# Replacing NAN values

In [44]:
columns1 = ['c1','c2','c3','c4']
index1 = ['i1','i2','i3','i4']
np.random.seed(0)  # this will fix your random numbers. Not so much random now!!
data = np.random.randn(4,4)
df = pd.DataFrame(data= data, index= index1, columns= columns1)
df2 = df[df>(df['c2'].mean())]
df2


Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,,0.950088,
i3,,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


In [45]:
df2.fillna(value=0, inplace=False)

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,0.0,0.950088,0.0
i3,0.0,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


In [46]:
df2

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,,0.950088,
i3,,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


In [47]:
df2.loc['i2'].fillna(value=df2.loc['i1'].mean(), inplace=True)
# we only replaced NANs in row i2

In [48]:
df2

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,1.34596,0.950088,1.34596
i3,,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


# Concatenating two or more DataFrames
You are given two dataframes to do an analysis combined. You can concatenate them into one DataFrame

In [49]:
columns1 = ['c1','c2','c3','c4']
index1 = ['i1','i2','i3','i4']
np.random.seed(0)  # this will fix your random numbers. Not so much random now!!
data = np.random.randn(4,4)
df1 = pd.DataFrame(data= data, index= index1, columns= columns1)
df1

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,-0.977278,0.950088,-0.151357
i3,-0.103219,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674


In [50]:
columns1 = ['c1','c2','c3','c4']
index2 = ['r1','r2','r3','r4']
np.random.seed(1)  # this will fix your random numbers. Not so much random now!!
data = np.random.randint(low=0, high=10, size=(4,4))
df2 = pd.DataFrame(data= data, index=index2, columns= columns1)
df2

Unnamed: 0,c1,c2,c3,c4
r1,5,8,9,5
r2,0,0,1,7
r3,6,9,2,4
r4,5,2,4,2


In [51]:
df3 = pd.concat([df1,df2], ignore_index=False)
df3

Unnamed: 0,c1,c2,c3,c4
i1,1.764052,0.400157,0.978738,2.240893
i2,1.867558,-0.977278,0.950088,-0.151357
i3,-0.103219,0.410599,0.144044,1.454274
i4,0.761038,0.121675,0.443863,0.333674
r1,5.0,8.0,9.0,5.0
r2,0.0,0.0,1.0,7.0
r3,6.0,9.0,2.0,4.0
r4,5.0,2.0,4.0,2.0


In [52]:
df3 = pd.concat([df1,df2], ignore_index=True)
df3

Unnamed: 0,c1,c2,c3,c4
0,1.764052,0.400157,0.978738,2.240893
1,1.867558,-0.977278,0.950088,-0.151357
2,-0.103219,0.410599,0.144044,1.454274
3,0.761038,0.121675,0.443863,0.333674
4,5.0,8.0,9.0,5.0
5,0.0,0.0,1.0,7.0
6,6.0,9.0,2.0,4.0
7,5.0,2.0,4.0,2.0


# Read and write datasets into pandas DataFrame

In [53]:
file_name = 'Fake_Data'
df = pd.read_csv('Fake_Data.csv', sep=',')  # pay attention to the file extention, seperator and other arguments
df.head(5)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,California,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,California,166187.94


First make a copy of your original dataFrame

In [54]:
df_eng = df.copy(deep=True)

In [55]:
df.head(2)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06


In [56]:
df_eng.head(2)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06


In [57]:
# first create a function
def func1 (argument):
    out= 1/ (1 +np.sqrt(np.log10(argument)))  # Sigmoid function
    return(out)

In [58]:
func1(2) # test it

0.64571868930122911

Using Lambda trick to apply our function to all values in a column and engineer a new column

In [59]:
df_eng.columns

Index(['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit'], dtype='object')

In [60]:
 df_eng['Eng1'] = df_eng['Profit'].apply(lambda x: func1(x))

In [61]:
 df_eng['Eng2'] = df_eng['R&D Spend']+df_eng['Administration']

In [62]:
df_eng.head(5)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit,Eng1,Eng2
0,165349.2,136897.8,471784.1,New York,192261.83,0.303152,302247.0
1,162597.7,151377.59,443898.53,California,191792.06,0.303174,313975.29
2,153441.51,101145.55,407934.54,California,191050.39,0.303207,254587.06
3,144372.41,118671.85,383199.62,New York,182901.99,0.303587,263044.26
4,142107.34,91391.77,366168.42,California,166187.94,0.304427,233499.11


In [63]:
df_eng.to_csv('New_Eng_DataFrame.csv',index=False)
df0 = pd.read_csv('New_Eng_DataFrame.csv')
df0.head(10)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit,Eng1,Eng2
0,165349.2,136897.8,471784.1,New York,192261.83,0.303152,302247.0
1,162597.7,151377.59,443898.53,California,191792.06,0.303174,313975.29
2,153441.51,101145.55,407934.54,California,191050.39,0.303207,254587.06
3,144372.41,118671.85,383199.62,New York,182901.99,0.303587,263044.26
4,142107.34,91391.77,366168.42,California,166187.94,0.304427,233499.11
5,131876.9,99814.71,362861.36,New York,156991.12,0.30493,231691.61
6,134615.46,147198.87,127716.82,California,156122.51,0.304979,281814.33
7,130298.13,145530.06,323876.68,New York,155752.6,0.305,275828.19
8,120542.52,148718.95,311613.29,New York,152211.77,0.305204,269261.47
9,123334.88,108679.17,304981.62,California,149759.96,0.305348,232014.05
