# Intro
In this tutorial, we'll learn about using numpy and pandas libraries for data manipulation from scratch. Instead of going into theory, we'll take a practical approach.

Main Goals:
1. 6 Important things you should know about Numpy and Pandas
2. Starting with Numpy
3. Starting with Pandas
4. Exploring an ML Data Set
5. Building a Random Forest Model

1. The data manipulation capabilities of pandas are built on top of the numpy library. In a way, numpy is a dependency of the pandas library.
2. Pandas is best at handling tabular data sets comprising different variable types (integer, float, double, etc.). In addition, the pandas library can also be used to perform even the most naive of tasks such as loading data or doing feature engineering on time series data.
3. Numpy is most suitable for performing basic numerical computations such as mean, median, range, etc. Alongside, it also supports the creation of multi-dimensional arrays.
4. Numpy library can also be used to integrate C/C++ and Fortran code.
5. Remember, python is a zero indexing language unlike R where indexing starts at one.
6. The best part of learning pandas and numpy is the strong active community support you'll get from around the world.

# Starting with Numpy


In [2]:
import numpy as np
np.__version__


'1.13.3'

In [3]:
#Create a list comprising of numbers from 0 to 9
L=list(range(10))

In [4]:
L

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]:
#converting integers to string - this style of handling 
#lists is known as list comprehension.
#List comprehension offers a versatile 
#way to handle list manipulations tasks easily. 
#We'll learn about them in future tutorials. Here's an example.  

[str(c) for c in L]

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [6]:
[type(item) for item in L]

[int, int, int, int, int, int, int, int, int, int]

#### Creating Arrays
Numpy arrays are homogeneous in nature, i.e., they comprise one data type (integer, float, double, etc.) unlike lists.

In [7]:
np.zeros(10,dtype="int")

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [8]:
np.ones((3,5),dtype="float")

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [9]:
#Creating arrays of predefined values 
np.full((3,5),1.23)

array([[ 1.23,  1.23,  1.23,  1.23,  1.23],
       [ 1.23,  1.23,  1.23,  1.23,  1.23],
       [ 1.23,  1.23,  1.23,  1.23,  1.23]])

In [10]:
#Create an array with a set sequence
np.arange(0,20,2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [11]:
#Create an array with even space between the given range of values 
np.linspace(0,1,5)

array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

In [12]:
#Create a 3 by 3 array with mean 0 and standard deviation 1 in a given dimension
np.random.normal(0,1,(3,3))

array([[-0.62596669,  1.1661209 , -2.2010724 ],
       [-0.30759059, -0.56651572, -0.54376006],
       [ 0.91884133,  0.47201188, -1.20998295]])

In [13]:
# Create an identity matrix
np.eye(3)

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [14]:
#Set a random seed
np.random.seed(0)

x1=np.random.randint(10,size=6)
x2=np.random.randint(10,size=(3,4))
x3=np.random.randint(10,size=(3,4,5))

In [15]:
x1

array([5, 0, 3, 3, 7, 9])

In [16]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [17]:
print("x3 ndim", x3.ndim)
print("x3 shape:",x3.shape)
print("x3 size:",x3.size)

x3 ndim 3
x3 shape: (3, 4, 5)
x3 size: 60


### Array indeixing


In [18]:
#3st row and 4th column value
x2[2,3]

7

In [19]:
x2[1,0]

7

## Array Indexing
multiple ways of accessing values from an array

In [20]:
x=np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [21]:
x[:5]

array([0, 1, 2, 3, 4])

In [22]:
x[4:]

array([4, 5, 6, 7, 8, 9])

In [23]:
x[4:7]

array([4, 5, 6])

In [24]:
#return elements at even place
x[::2]

array([0, 2, 4, 6, 8])

In [25]:
#return elements from first position step by two
x[1::2]

array([1, 3, 5, 7, 9])

In [26]:
#reverse the arry
x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

### Array Concatencation
Many a time, we are required to combine different arrays. So, instead of typing each of their elements manually, you can use array concatenation to handle such tasks easily.

In [27]:
x=np.array([1,2,3])
y=np.array([3,2,1])
z=[21,21,21]
np.concatenate([x,y,z])

array([ 1,  2,  3,  3,  2,  1, 21, 21, 21])

In [28]:
grid=np.array([[1,2,3],[4,5,6]])
np.concatenate([grid,grid])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [29]:
#Using its axis parameter, you can define
#row-wise or column-wise matrix
np.concatenate([grid,grid],axis=1)

#by default, the axis parameter is set to 0

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

Until now, we used the concatenation function of arrays of equal dimension. But, what if you are required to combine a 2D array with 1D array? In such situations, np.concatenate might not be the best option to use. Instead, you can use np.vstack or np.hstack to do the task. Let's see how!

In [30]:
x=np.array([3,4,5])
gri=np.array([[1,2,3],[17,18,19]])
np.vstack([x,grid])

array([[3, 4, 5],
       [1, 2, 3],
       [4, 5, 6]])

In [31]:
z=np.array([[9],[9]])
np.hstack([grid,z])

array([[1, 2, 3, 9],
       [4, 5, 6, 9]])

Splitting the arrays based on predefined positions

In [32]:
x=np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [33]:
x1,x2,x3=np.split(x,[3,6])
print(x1,x2,x3)

[0 1 2] [3 4 5] [6 7 8 9]


In [34]:
t1,t2,t3,t4=np.split(x,[2,5,7])
print(t1,t2,t3,t4)

[0 1] [2 3 4] [5 6] [7 8 9]


In [35]:
np.hstack([t1,t2,t3,t4])

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [36]:
grid=np.arange(16).reshape((4,4))
grid

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [37]:
upper,lower=np.vsplit(grid,[2])
print(upper,lower)

[[0 1 2 3]
 [4 5 6 7]] [[ 8  9 10 11]
 [12 13 14 15]]


In addition to the functions we learned above, there are several other mathematical functions available in the numpy library such as sum, divide, multiple, abs, power, mod, sin, cos, tan, log, var, min, mean, max, etc. which you can be used to perform basic arithmetic calculations. Feel free to refer to numpy documentation for more information on such functions.

## LEt's start with pandas 

In [38]:
import pandas as pd
#create a data frame - dictionary is used here where keys get
#converted to column names and values to row values.
data=pd.DataFrame({'Country': 
                   ['Russia','Colombia','Chile','Equador','Nigeria'],
                    'Rank':[121,40,100,130,11]})
data

Unnamed: 0,Country,Rank
0,Russia,121
1,Colombia,40
2,Chile,100
3,Equador,130
4,Nigeria,11


In [39]:
#Quick analysis of the dataset
data.describe()

Unnamed: 0,Rank
count,5.0
mean,80.4
std,52.300096
min,11.0
25%,40.0
50%,100.0
75%,121.0
max,130.0


Remember, describe() method computes summary statistics of integer / double variables. To get the complete information about the data set, we can use info() function.

In [40]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
Country    5 non-null object
Rank       5 non-null int64
dtypes: int64(1), object(1)
memory usage: 160.0+ bytes


In [41]:
data=pd.DataFrame({'group':['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                   'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,group,ounces
0,a,4.0
1,a,3.0
2,a,12.0
3,b,6.0
4,b,7.5
5,b,8.0
6,c,3.0
7,c,5.0
8,c,6.0


In [42]:
#Sorting data frame by ouces - inplace=True makes changes to the data
data.sort_values(by=['ounces'],ascending=True,inplace=False)

Unnamed: 0,group,ounces
1,a,3.0
6,c,3.0
0,a,4.0
7,c,5.0
3,b,6.0
8,c,6.0
4,b,7.5
5,b,8.0
2,a,12.0


In [43]:
#We can sort the data not just one column but multiple columns
data.sort_values(by=['group','ounces'],ascending=[True,False],inplace=False)

Unnamed: 0,group,ounces
2,a,12.0
0,a,4.0
1,a,3.0
5,b,8.0
4,b,7.5
3,b,6.0
8,c,6.0
7,c,5.0
6,c,3.0


Often, we get data sets with duplicate rows, which is nothing but noise. Therefore, before training the model, we need to make sure we get rid of such inconsistencies in the data set. Let's see how we can remove duplicate rows.

In [44]:
#Creating another data with duplicated rows
data=pd.DataFrame({'k1':['one']*3 + ['two']*4, 'k2':[3,2,1,3,3,4,4]})
data

Unnamed: 0,k1,k2
0,one,3
1,one,2
2,one,1
3,two,3
4,two,3
5,two,4
6,two,4


In [45]:
#Sort values
data.sort_values(by="k2")

Unnamed: 0,k1,k2
2,one,1
1,one,2
0,one,3
3,two,3
4,two,3
5,two,4
6,two,4


In [46]:
#Remove duplicates
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,3
1,one,2
2,one,1
3,two,3
5,two,4


Here, we removed duplicates based on matching row values across all columns. Alternatively, we can also remove duplicates based on a particular column. Let's remove duplicate values from the k1 column.

In [47]:
data.drop_duplicates(subset='k1')

Unnamed: 0,k1,k2
0,one,3
3,two,3


In [48]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami','corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Now, we want to create a new variable which indicates the type of animal which acts as the source of the food. To do that, first we'll create a dictionary to map the food to the animals. Then, we'll use map function to map the dictionary's values to the keys. Let's see how is it done.

In [49]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}


In [50]:
def meat_2_animal(series):
    if series['food']=='bacon':
        return 'pig'
    elif series['food']=='pulled pork':
        return 'pig'
    elif series['food']=='pastrami':
        return 'cow'
    elif series['food']=='corned beef':
        return 'cow'
    elif series['food']=='honey ham':
        return 'pig'
    else:
        return 'salmon'
#Create a new variable
data['animal']=data['food'].map(str.lower).map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [51]:
#another way of doing this is to convert the food values to the lower case and 
#apply the function
lower=lambda x:x.lower()
data['food']=data['food'].apply(lower)
data['animal2']=data.apply(meat_2_animal,axis="columns")
data

Unnamed: 0,food,ounces,animal,animal2
0,bacon,4.0,pig,pig
1,pulled pork,3.0,pig,pig
2,bacon,12.0,pig,pig
3,pastrami,6.0,cow,cow
4,corned beef,7.5,cow,cow
5,bacon,8.0,pig,pig
6,pastrami,3.0,cow,cow
7,honey ham,5.0,pig,pig
8,nova lox,6.0,salmon,salmon


In [52]:
#Could list comprehension be applied here?
#[x.lower() in data.shape[0] for x in data]
data['food']=[row.lower() for row in data['food']]
data
data['animal2']=[meat_2_animal(row) for row in data['food']]

TypeError: string indices must be integers

In [53]:
meat_2_animal(data['food'])

KeyError: 'food'

# ======= Getting Side tracked to see how list comprehension applies in Pandas DataFrame =======


In [54]:
#Creating an example dataframe
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df1=pd.DataFrame(data,index=['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df1

Unnamed: 0,name,reports,year
Cochice,Jason,4,2012
Pima,Molly,24,2012
Santa Cruz,Tina,31,2013
Maricopa,Jake,2,2014
Yuma,Amy,3,2014


In [55]:
#List Comprehension
#We want to create a new column called next year using Loop to
#get the ideas across
next_year=[]
for row in df1['year']:
    next_year.append(row+1)
df1['next_year']=next_year
df1

Unnamed: 0,name,reports,year,next_year
Cochice,Jason,4,2012,2013
Pima,Molly,24,2012,2013
Santa Cruz,Tina,31,2013,2014
Maricopa,Jake,2,2014,2015
Yuma,Amy,3,2014,2015


In [56]:
#Using list comprehension to do the same thing
df1['previous year']=[row-1 for row in df1['year']]
df1

Unnamed: 0,name,reports,year,next_year,previous year
Cochice,Jason,4,2012,2013,2011
Pima,Molly,24,2012,2013,2011
Santa Cruz,Tina,31,2013,2014,2012
Maricopa,Jake,2,2014,2015,2013
Yuma,Amy,3,2014,2015,2013


In [57]:
df1['random col']=[row+row*2 for row in df1['year']]
df1

Unnamed: 0,name,reports,year,next_year,previous year,random col
Cochice,Jason,4,2012,2013,2011,6036
Pima,Molly,24,2012,2013,2011,6036
Santa Cruz,Tina,31,2013,2014,2012,6039
Maricopa,Jake,2,2014,2015,2013,6042
Yuma,Amy,3,2014,2015,2013,6042


# END of Side exploration of the list comprehension for pandas data frame 

Another way to create a new variable is by using the assign function. With this tutorial, as you keep discovering the new functions, you'll realize how powerful pandas is.

In [58]:
data.assign(new_variable=data['ounces']*10)

AttributeError: 'dict' object has no attribute 'assign'

In [59]:
#Remove the animal2 from the dataframe

In [60]:
data.drop("animal2",axis='columns',inplace=True)
data

AttributeError: 'dict' object has no attribute 'drop'

We frequently find missing values in our data set. A quick method for imputing missing values is by filling the missing value with any random number. Not just missing values, you may find lots of outliers in your data set, which might require replacing. Let's see how can we replace values.

In [61]:
#Series function from pandas are used to create arrays
data=pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [62]:
#Replace -999 with NaN values using the replace function
data.replace(-999,np.nan,inplace=True)
data

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [63]:
#Replcaing multiple values at a time 
data=pd.Series([1., -999., 2., -999., -1000., 3.])
data.replace([-999,-1000],np.nan,inplace=True)
data

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

# Column Renaming and axis (row names)

In [64]:
data=pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=['Ohio', 'Colorado', 'New York'],
                  columns=['one', 'two', 'three', 'four'])

In [65]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [66]:
#Using rename function
data.rename(index={'Ohio':"SanF"},columns=
           {'one':'one_p','two':'two_p'},inplace=True)
data

Unnamed: 0,one_p,two_p,three,four
SanF,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [67]:
#Using String functions to perform this operation
data.rename(index=str.upper,columns=str.title,inplace=True)
data

Unnamed: 0,One_P,Two_P,Three,Four
SANF,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


In [68]:
data.rename(index=str.lower,columns=str.upper,inplace=True)
data

Unnamed: 0,ONE_P,TWO_P,THREE,FOUR
sanf,0,1,2,3
colorado,4,5,6,7
new york,8,9,10,11


# Using Pandas to categorize(bin) continuous variables

In [69]:
ages=[20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [70]:
#Understand the output - '(' means the value is included in the bin, 
#'[' means the value is excluded
bins=[18, 25, 35, 60, 100]
cats=pd.cut(ages,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [71]:
#To include the right bin value
pd.cut(ages,bins,right=False)

[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

In [72]:
#pandas library intrinsically assigns an encoding to categorical variables
cats.labels

  from ipykernel import kernelapp as app


array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [73]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [74]:
pd.value_counts(cats) #Number of values per category

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [75]:
bin_names=["Youth","YoungAdult","MiddleAge","Senior"]
new_cats=pd.cut(ages,bins,labels=bin_names)

pd.value_counts(new_cats)

Youth         5
MiddleAge     3
YoungAdult    3
Senior        1
dtype: int64

In [76]:
pd.value_counts(new_cats).cumsum()

Youth          5
MiddleAge      8
YoungAdult    11
Senior        12
dtype: int64

Let's proceed and learn about grouping data and creating pivots in pandas. It's an immensely important data analysis method which you'd probably have to use on every data set you work with.

In [77]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,1.254414,1.149076,a,one
1,1.419102,-1.193578,a,two
2,-0.743856,1.141042,b,one
3,-2.517437,1.509445,b,two
4,-1.507096,1.067775,a,one


In [78]:
#Calculate the mean of data1 column by key1
grouped=df['data1'].groupby(df['key1'])
grouped.mean()

key1
a    0.388807
b   -1.630647
Name: data1, dtype: float64

In [79]:
grouped1=df["data1"].groupby(df['key2'])
grouped1.mean()

key2
one   -0.332179
two   -0.549168
Name: data1, dtype: float64

In [80]:
#Slicing the data frame 
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.random.randn(6,4), index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.686589,0.014873,-0.375666,-0.038224
2013-01-02,0.367974,-0.044724,-0.302375,-2.224404
2013-01-03,0.724006,0.359003,1.076121,0.192141
2013-01-04,0.852926,0.018357,0.428304,0.996278
2013-01-05,-0.49115,0.712678,1.11334,-2.153675
2013-01-06,-0.416111,-1.070897,0.221139,-1.123057


In [81]:
df['Category']=['M','M','D','D','Y','Y']

In [82]:
df

Unnamed: 0,A,B,C,D,Category
2013-01-01,-0.686589,0.014873,-0.375666,-0.038224,M
2013-01-02,0.367974,-0.044724,-0.302375,-2.224404,M
2013-01-03,0.724006,0.359003,1.076121,0.192141,D
2013-01-04,0.852926,0.018357,0.428304,0.996278,D
2013-01-05,-0.49115,0.712678,1.11334,-2.153675,Y
2013-01-06,-0.416111,-1.070897,0.221139,-1.123057,Y


In [83]:
df.groupby(df["Category"])

<pandas.core.groupby.DataFrameGroupBy object at 0x114c80a20>

In [84]:
df["A"].mean()

0.05850936252773096

In [85]:
df[:3]

Unnamed: 0,A,B,C,D,Category
2013-01-01,-0.686589,0.014873,-0.375666,-0.038224,M
2013-01-02,0.367974,-0.044724,-0.302375,-2.224404,M
2013-01-03,0.724006,0.359003,1.076121,0.192141,D


In [86]:
#Slicing based on date range
df["20130101":"20130104"]

Unnamed: 0,A,B,C,D,Category
2013-01-01,-0.686589,0.014873,-0.375666,-0.038224,M
2013-01-02,0.367974,-0.044724,-0.302375,-2.224404,M
2013-01-03,0.724006,0.359003,1.076121,0.192141,D
2013-01-04,0.852926,0.018357,0.428304,0.996278,D


In [87]:
#Slicing based on column names
df.loc[:,["A","B"]]

Unnamed: 0,A,B
2013-01-01,-0.686589,0.014873
2013-01-02,0.367974,-0.044724
2013-01-03,0.724006,0.359003
2013-01-04,0.852926,0.018357
2013-01-05,-0.49115,0.712678
2013-01-06,-0.416111,-1.070897


In [88]:
#Slicing based on both row index labels and column names
df.loc['20130102':'20130103',['A','B']]

Unnamed: 0,A,B
2013-01-02,0.367974,-0.044724
2013-01-03,0.724006,0.359003


In [89]:
#slicing based on index of columns (not column names)
df.iloc[3]

A            0.852926
B           0.0183572
C            0.428304
D            0.996278
Category            D
Name: 2013-01-04 00:00:00, dtype: object

In [90]:
#Returning specific range of rows by index number not names 
df.iloc[2:4,0:2]

Unnamed: 0,A,B
2013-01-03,0.724006,0.359003
2013-01-04,0.852926,0.018357


In [91]:
#Returns specific rows and columns using lists containing columns or row indexes
df.iloc[[1,5],[0,2]]

Unnamed: 0,A,C
2013-01-02,0.367974,-0.302375
2013-01-06,-0.416111,0.221139


In [92]:
#Similarly, we can do Boolean indexing based on column values as well. 
#This helps in filtering a data set based on a pre-defined condition.
#Boolean Indexing
df[df["B"]>.5]

Unnamed: 0,A,B,C,D,Category
2013-01-05,-0.49115,0.712678,1.11334,-2.153675,Y


In [93]:
#Copy the data det 
df2=df.copy()
df2["E"]=['one', 'one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,Category,E
2013-01-01,-0.686589,0.014873,-0.375666,-0.038224,M,one
2013-01-02,0.367974,-0.044724,-0.302375,-2.224404,M,one
2013-01-03,0.724006,0.359003,1.076121,0.192141,D,two
2013-01-04,0.852926,0.018357,0.428304,0.996278,D,three
2013-01-05,-0.49115,0.712678,1.11334,-2.153675,Y,four
2013-01-06,-0.416111,-1.070897,0.221139,-1.123057,Y,three


In [94]:
#Select rows based on column values 
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,Category,E
2013-01-03,0.724006,0.359003,1.076121,0.192141,D,two
2013-01-05,-0.49115,0.712678,1.11334,-2.153675,Y,four


In [95]:
#Select all rows except those with two and four
df2[~df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,Category,E
2013-01-01,-0.686589,0.014873,-0.375666,-0.038224,M,one
2013-01-02,0.367974,-0.044724,-0.302375,-2.224404,M,one
2013-01-04,0.852926,0.018357,0.428304,0.996278,D,three
2013-01-06,-0.416111,-1.070897,0.221139,-1.123057,Y,three


In [96]:
#Using a query method to select columns based on a criterion
df.query("A>C") #Querying rows where values in column A is larger than C?

Unnamed: 0,A,B,C,D,Category
2013-01-02,0.367974,-0.044724,-0.302375,-2.224404,M
2013-01-04,0.852926,0.018357,0.428304,0.996278,D


In [97]:
df.query("A<B|C>A")

Unnamed: 0,A,B,C,D,Category
2013-01-01,-0.686589,0.014873,-0.375666,-0.038224,M
2013-01-03,0.724006,0.359003,1.076121,0.192141,D
2013-01-05,-0.49115,0.712678,1.11334,-2.153675,Y
2013-01-06,-0.416111,-1.070897,0.221139,-1.123057,Y


Pivot tables are extremely useful in analyzing data using a customized tabular format. I think, among other things, Excel is popular because of the pivot table option. It offers a super-quick way to analyze data.

In [98]:
#Create a data frame
data=pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,group,ounces
0,a,4.0
1,a,3.0
2,a,12.0
3,b,6.0
4,b,7.5
5,b,8.0
6,c,3.0
7,c,5.0
8,c,6.0


In [99]:
#Calculate means of each group
data.pivot_table(values='ounces',index='group',aggfunc=np.mean)

Unnamed: 0_level_0,ounces
group,Unnamed: 1_level_1
a,6.333333
b,7.166667
c,4.666667


In [100]:
#Calculate count by each group
data.pivot_table(values='ounces',index='group',aggfunc='count')

Unnamed: 0_level_0,ounces
group,Unnamed: 1_level_1
a,3
b,3
c,3


In [101]:
#Calculate sum by each group
data.pivot_table(values='ounces',index='group',aggfunc='sum')

Unnamed: 0_level_0,ounces
group,Unnamed: 1_level_1
a,19.0
b,21.5
c,14.0


## Exploring ML data set
The dependent variable is "target." It is a binary classification problem. We need to predict if the salary of a given person is less than or more than 50K.

In [103]:
#load the data
train=pd.read_csv("/Users/shengyuchen/Dropbox/Engagement - Business/My Hub/AI:ML:DL Playground/Basics/Python Data Manipulations/Adult/train.csv")
test=pd.read_csv("/Users/shengyuchen/Dropbox/Engagement - Business/My Hub/AI:ML:DL Playground/Basics/Python Data Manipulations/Adult/test.csv")




In [104]:
train.shape

(32561, 15)

In [105]:
test.shape

(16281, 15)

In [106]:
train.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [107]:
train.describe()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [108]:
train["target"].describe()

count      32561
unique         2
top        <=50K
freq       24720
Name: target, dtype: object

In [109]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         30725 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education.num     32561 non-null int64
marital.status    32561 non-null object
occupation        30718 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital.gain      32561 non-null int64
capital.loss      32561 non-null int64
hours.per.week    32561 non-null int64
native.country    31978 non-null object
target            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [110]:
train["education"].describe()

count        32561
unique          16
top        HS-grad
freq         10501
Name: education, dtype: object

In [149]:
import bokeh
from bokeh.plotting import figure, output_file,show
from bokeh.plotting import *
output_notebook()


p=figure(tools="pan,box_zoom,reset,save,lasso_select",
         x_axis_label='education years',y_axis_label='age')

In [159]:
import matplotlib
%matplotlib inline
#train.iloc[:,0].plot()

In [151]:
train["age"].describe()

count    32561.000000
mean        38.581647
std         13.640433
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

In [154]:
train.iloc[:10,0]

0    39
1    50
2    38
3    53
4    28
5    37
6    49
7    52
8    31
9    42
Name: age, dtype: int64

In [158]:
train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [160]:
nans=train.shape[0]-train.dropna().shape[0]
print("%d rows have missing values in the training data" %nans)

2399 rows have missing values in the training data
