# Intro
In this tutorial, we'll learn about using numpy and pandas libraries for data manipulation from scratch. Instead of going into theory, we'll take a practical approach.

Main Goals:
1. 6 Important things you should know about Numpy and Pandas
2. Starting with Numpy
3. Starting with Pandas
4. Exploring an ML Data Set
5. Building a Random Forest Model

1. The data manipulation capabilities of pandas are built on top of the numpy library. In a way, numpy is a dependency of the pandas library.
2. Pandas is best at handling tabular data sets comprising different variable types (integer, float, double, etc.). In addition, the pandas library can also be used to perform even the most naive of tasks such as loading data or doing feature engineering on time series data.
3. Numpy is most suitable for performing basic numerical computations such as mean, median, range, etc. Alongside, it also supports the creation of multi-dimensional arrays.
4. Numpy library can also be used to integrate C/C++ and Fortran code.
5. Remember, python is a zero indexing language unlike R where indexing starts at one.
6. The best part of learning pandas and numpy is the strong active community support you'll get from around the world.

# Starting with Numpy


In [1]:
import numpy as np
np.__version__


'1.13.3'

In [2]:
#Create a list comprising of numbers from 0 to 9
L=list(range(10))

In [3]:
L

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [4]:
#converting integers to string - this style of handling 
#lists is known as list comprehension.
#List comprehension offers a versatile 
#way to handle list manipulations tasks easily. 
#We'll learn about them in future tutorials. Here's an example.  

[str(c) for c in L]

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [5]:
[type(item) for item in L]

[int, int, int, int, int, int, int, int, int, int]

#### Creating Arrays
Numpy arrays are homogeneous in nature, i.e., they comprise one data type (integer, float, double, etc.) unlike lists.

In [6]:
np.zeros(10,dtype="int")

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [7]:
np.ones((3,5),dtype="float")

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [8]:
#Creating arrays of predefined values 
np.full((3,5),1.23)

array([[ 1.23,  1.23,  1.23,  1.23,  1.23],
       [ 1.23,  1.23,  1.23,  1.23,  1.23],
       [ 1.23,  1.23,  1.23,  1.23,  1.23]])

In [9]:
#Create an array with a set sequence
np.arange(0,20,2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [10]:
#Create an array with even space between the given range of values 
np.linspace(0,1,5)

array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

In [11]:
#Create a 3 by 3 array with mean 0 and standard deviation 1 in a given dimension
np.random.normal(0,1,(3,3))

array([[ 0.68187925,  2.18691082, -0.73187111],
       [-0.5263109 , -1.37857471, -0.20828398],
       [ 1.07799132, -0.17898609, -0.86364607]])

In [12]:
# Create an identity matrix
np.eye(3)

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [15]:
#Set a random seed
np.random.seed(0)

x1=np.random.randint(10,size=6)
x2=np.random.randint(10,size=(3,4))
x3=np.random.randint(10,size=(3,4,5))

In [16]:
x1

array([5, 0, 3, 3, 7, 9])

In [17]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [18]:
print("x3 ndim", x3.ndim)
print("x3 shape:",x3.shape)
print("x3 size:",x3.size)

x3 ndim 3
x3 shape: (3, 4, 5)
x3 size: 60


### Array indeixing


In [19]:
#3st row and 4th column value
x2[2,3]

7

In [20]:
x2[1,0]

7

## Array Indexing
multiple ways of accessing values from an array

In [21]:
x=np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [22]:
x[:5]

array([0, 1, 2, 3, 4])

In [23]:
x[4:]

array([4, 5, 6, 7, 8, 9])

In [24]:
x[4:7]

array([4, 5, 6])

In [26]:
#return elements at even place
x[::2]

array([0, 2, 4, 6, 8])

In [27]:
#return elements from first position step by two
x[1::2]

array([1, 3, 5, 7, 9])

In [28]:
#reverse the arry
x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

### Array Concatencation
Many a time, we are required to combine different arrays. So, instead of typing each of their elements manually, you can use array concatenation to handle such tasks easily.

In [30]:
x=np.array([1,2,3])
y=np.array([3,2,1])
z=[21,21,21]
np.concatenate([x,y,z])

array([ 1,  2,  3,  3,  2,  1, 21, 21, 21])

In [32]:
grid=np.array([[1,2,3],[4,5,6]])
np.concatenate([grid,grid])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [35]:
#Using its axis parameter, you can define
#row-wise or column-wise matrix
np.concatenate([grid,grid],axis=1)

#by default, the axis parameter is set to 0

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

Until now, we used the concatenation function of arrays of equal dimension. But, what if you are required to combine a 2D array with 1D array? In such situations, np.concatenate might not be the best option to use. Instead, you can use np.vstack or np.hstack to do the task. Let's see how!

In [36]:
x=np.array([3,4,5])
gri=np.array([[1,2,3],[17,18,19]])
np.vstack([x,grid])

array([[3, 4, 5],
       [1, 2, 3],
       [4, 5, 6]])

In [37]:
z=np.array([[9],[9]])
np.hstack([grid,z])

array([[1, 2, 3, 9],
       [4, 5, 6, 9]])

Splitting the arrays based on predefined positions

In [38]:
x=np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [39]:
x1,x2,x3=np.split(x,[3,6])
print(x1,x2,x3)

[0 1 2] [3 4 5] [6 7 8 9]


In [41]:
t1,t2,t3,t4=np.split(x,[2,5,7])
print(t1,t2,t3,t4)

[0 1] [2 3 4] [5 6] [7 8 9]


In [42]:
np.hstack([t1,t2,t3,t4])

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [45]:
grid=np.arange(16).reshape((4,4))
grid

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [47]:
upper,lower=np.vsplit(grid,[2])
print(upper,lower)

[[0 1 2 3]
 [4 5 6 7]] [[ 8  9 10 11]
 [12 13 14 15]]


In addition to the functions we learned above, there are several other mathematical functions available in the numpy library such as sum, divide, multiple, abs, power, mod, sin, cos, tan, log, var, min, mean, max, etc. which you can be used to perform basic arithmetic calculations. Feel free to refer to numpy documentation for more information on such functions.

## LEt's start with pandas 

In [48]:
import pandas as pd
#create a data frame - dictionary is used here where keys get
#converted to column names and values to row values.
data=pd.DataFrame({'Country': 
                   ['Russia','Colombia','Chile','Equador','Nigeria'],
                    'Rank':[121,40,100,130,11]})
data

Unnamed: 0,Country,Rank
0,Russia,121
1,Colombia,40
2,Chile,100
3,Equador,130
4,Nigeria,11


In [49]:
#Quick analysis of the dataset
data.describe()

Unnamed: 0,Rank
count,5.0
mean,80.4
std,52.300096
min,11.0
25%,40.0
50%,100.0
75%,121.0
max,130.0


Remember, describe() method computes summary statistics of integer / double variables. To get the complete information about the data set, we can use info() function.

In [50]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
Country    5 non-null object
Rank       5 non-null int64
dtypes: int64(1), object(1)
memory usage: 160.0+ bytes


In [51]:
data=pd.DataFrame({'group':['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                   'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,group,ounces
0,a,4.0
1,a,3.0
2,a,12.0
3,b,6.0
4,b,7.5
5,b,8.0
6,c,3.0
7,c,5.0
8,c,6.0


In [52]:
#Sorting data frame by ouces - inplace=True makes changes to the data
data.sort_values(by=['ounces'],ascending=True,inplace=False)

Unnamed: 0,group,ounces
1,a,3.0
6,c,3.0
0,a,4.0
7,c,5.0
3,b,6.0
8,c,6.0
4,b,7.5
5,b,8.0
2,a,12.0


In [54]:
#We can sort the data not just one column but multiple columns
data.sort_values(by=['group','ounces'],ascending=[True,False],inplace=False)

Unnamed: 0,group,ounces
2,a,12.0
0,a,4.0
1,a,3.0
5,b,8.0
4,b,7.5
3,b,6.0
8,c,6.0
7,c,5.0
6,c,3.0


Often, we get data sets with duplicate rows, which is nothing but noise. Therefore, before training the model, we need to make sure we get rid of such inconsistencies in the data set. Let's see how we can remove duplicate rows.

In [55]:
#Creating another data with duplicated rows
data=pd.DataFrame({'k1':['one']*3 + ['two']*4, 'k2':[3,2,1,3,3,4,4]})
data

Unnamed: 0,k1,k2
0,one,3
1,one,2
2,one,1
3,two,3
4,two,3
5,two,4
6,two,4


In [56]:
#Sort values
data.sort_values(by="k2")

Unnamed: 0,k1,k2
2,one,1
1,one,2
0,one,3
3,two,3
4,two,3
5,two,4
6,two,4


In [57]:
#Remove duplicates
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,3
1,one,2
2,one,1
3,two,3
5,two,4


Here, we removed duplicates based on matching row values across all columns. Alternatively, we can also remove duplicates based on a particular column. Let's remove duplicate values from the k1 column.

In [58]:
data.drop_duplicates(subset='k1')

Unnamed: 0,k1,k2
0,one,3
3,two,3


In [93]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami','corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Now, we want to create a new variable which indicates the type of animal which acts as the source of the food. To do that, first we'll create a dictionary to map the food to the animals. Then, we'll use map function to map the dictionary's values to the keys. Let's see how is it done.

In [94]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}


In [95]:
def meat_2_animal(series):
    if series['food']=='bacon':
        return 'pig'
    elif series['food']=='pulled pork':
        return 'pig'
    elif series['food']=='pastrami':
        return 'cow'
    elif series['food']=='corned beef':
        return 'cow'
    elif series['food']=='honey ham':
        return 'pig'
    else:
        return 'salmon'
#Create a new variable
data['animal']=data['food'].map(str.lower).map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [96]:
#another way of doing this is to convert the food values to the lower case and 
#apply the function
lower=lambda x:x.lower()
data['food']=data['food'].apply(lower)
data['animal2']=data.apply(meat_2_animal,axis="columns")
data

Unnamed: 0,food,ounces,animal,animal2
0,bacon,4.0,pig,pig
1,pulled pork,3.0,pig,pig
2,bacon,12.0,pig,pig
3,pastrami,6.0,cow,cow
4,corned beef,7.5,cow,cow
5,bacon,8.0,pig,pig
6,pastrami,3.0,cow,cow
7,honey ham,5.0,pig,pig
8,nova lox,6.0,salmon,salmon


In [103]:
#Could list comprehension be applied here?
#[x.lower() in data.shape[0] for x in data]
data['food']=[row.lower() for row in data['food']]
data
data['animal2']=[meat_2_animal(row) for row in data['food']]

TypeError: string indices must be integers

In [109]:
meat_2_animal(data['food'])

KeyError: 'food'

# ======= Getting Side tracked to see how list comprehension applies in Pandas DataFrame =======


In [79]:
#Creating an example dataframe
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df1=pd.DataFrame(data,index=['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df1

Unnamed: 0,name,reports,year
Cochice,Jason,4,2012
Pima,Molly,24,2012
Santa Cruz,Tina,31,2013
Maricopa,Jake,2,2014
Yuma,Amy,3,2014


In [80]:
#List Comprehension
#We want to create a new column called next year using Loop to
#get the ideas across
next_year=[]
for row in df1['year']:
    next_year.append(row+1)
df1['next_year']=next_year
df1

Unnamed: 0,name,reports,year,next_year
Cochice,Jason,4,2012,2013
Pima,Molly,24,2012,2013
Santa Cruz,Tina,31,2013,2014
Maricopa,Jake,2,2014,2015
Yuma,Amy,3,2014,2015


In [84]:
#Using list comprehension to do the same thing
df1['previous year']=[row-1 for row in df1['year']]
df1

Unnamed: 0,name,reports,year,next_year,previous year
Cochice,Jason,4,2012,2013,2011
Pima,Molly,24,2012,2013,2011
Santa Cruz,Tina,31,2013,2014,2012
Maricopa,Jake,2,2014,2015,2013
Yuma,Amy,3,2014,2015,2013


In [90]:
df1['random col']=[row+row*2 for row in df1['year']]
df1

Unnamed: 0,name,reports,year,next_year,previous year,random col
Cochice,Jason,4,2012,2013,2011,6036
Pima,Molly,24,2012,2013,2011,6036
Santa Cruz,Tina,31,2013,2014,2012,6039
Maricopa,Jake,2,2014,2015,2013,6042
Yuma,Amy,3,2014,2015,2013,6042


# END of Side exploration of the list comprehension for pandas data frame 

Another way to create a new variable is by using the assign function. With this tutorial, as you keep discovering the new functions, you'll realize how powerful pandas is.

In [69]:
data.assign(new_variable=data['ounces']*10)

Unnamed: 0,food,ounces,animal,animal2,new_variable
0,bacon,4.0,pig,pig,40.0
1,pulled pork,3.0,pig,pig,30.0
2,bacon,12.0,pig,pig,120.0
3,pastrami,6.0,cow,cow,60.0
4,corned beef,7.5,cow,cow,75.0
5,bacon,8.0,pig,pig,80.0
6,pastrami,3.0,cow,cow,30.0
7,honey ham,5.0,pig,pig,50.0
8,nova lox,6.0,salmon,salmon,60.0


In [70]:
#Remove the animal2 from the dataframe

In [74]:
data.drop("animal2",axis='columns',inplace=True)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We frequently find missing values in our data set. A quick method for imputing missing values is by filling the missing value with any random number. Not just missing values, you may find lots of outliers in your data set, which might require replacing. Let's see how can we replace values.

In [75]:
#Series function from pandas are used to create arrays
data=pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [76]:
#Replace -999 with NaN values using the replace function
data.replace(-999,np.nan,inplace=True)
data

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [110]:
#Replcaing multiple values at a time 
data=pd.Series([1., -999., 2., -999., -1000., 3.])
data.replace([-999,-1000],np.nan,inplace=True)
data

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

# Column Renaming and axis (row names)

In [111]:
data=pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=['Ohio', 'Colorado', 'New York'],
                  columns=['one', 'two', 'three', 'four'])

In [112]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [113]:
#Using rename function
data.rename(index={'Ohio':"SanF"},columns=
           {'one':'one_p','two':'two_p'},inplace=True)
data

Unnamed: 0,one_p,two_p,three,four
SanF,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [114]:
#Using String functions to perform this operation
data.rename(index=str.upper,columns=str.title,inplace=True)
data

Unnamed: 0,One_P,Two_P,Three,Four
SANF,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


In [115]:
data.rename(index=str.lower,columns=str.upper,inplace=True)
data

Unnamed: 0,ONE_P,TWO_P,THREE,FOUR
sanf,0,1,2,3
colorado,4,5,6,7
new york,8,9,10,11


# Using Pandas to categorize(bin) continuous variables

In [116]:
ages=[20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [117]:
#Understand the output - '(' means the value is included in the bin, 
#'[' means the value is excluded
bins=[18, 25, 35, 60, 100]
cats=pd.cut(ages,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [118]:
#To include the right bin value
pd.cut(ages,bins,right=False)

[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

In [119]:
#pandas library intrinsically assigns an encoding to categorical variables
cats.labels

  from ipykernel import kernelapp as app


array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [120]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [122]:
pd.value_counts(cats) #Number of values per category

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64