In this kernel, we'll learn about using numpy and pandas libraries for data manipulation from scratch. Instead of going into theory, we'll take a practical approach. 

First, we'll understand the syntax and commonly used functions of the numpy and pandas  libraries. Later, we'll work on a real-life data set. 

## Table of Contents

* Important things  about Numpy and Pandas
* Starting with Numpy
* Starting with Pandas
* Exploring an ML Data Set
* Building a Random Forest Model


##  Important things about Numpy and Pandas

1. The data manipulation capabilities of pandas are built on top of the numpy library. In a way, numpy is a dependency of the pandas library.

2. Pandas is best at handling tabular data sets comprising different variable types (integer, float, double, etc.). In addition, the pandas library can also be used to perform even the most naive of tasks such as loading data or doing feature engineering on time series data.

3. Numpy is most suitable for performing basic numerical computations such as mean, median, range, etc. Alongside, it also supports the creation of multi-dimensional arrays.

Just to give you a flavor of the numpy library, we'll quickly go through its syntax structures and some important commands such as slicing, indexing, concatenation, etc. All these commands will come in handy when using pandas as well. Let's get started!

## Starting with Numpy

In [None]:
#load the library and check its version, just to make sure we aren't using an older version
import numpy as np
np.__version__

In [None]:
#create a list comprising numbers from 0 to 20
L = list(range(21))
L

In [None]:
#converting integers to string - this style of handling lists is known as list comprehension.
[str(c) for c in L]

In [None]:
#List comprehension offers a versatile way to handle list manipulations tasks easily.
[type(item) for item in L]

### Creating Arrays
Numpy arrays are homogeneous in nature, i.e., they comprise one data type (integer, float, double, etc.) unlike lists.

In [None]:
#creating arrays
np.zeros(10, dtype='int')

In [None]:
#creating a 3 row x 8 column matrix
np.ones((3,8), dtype=float)

In [None]:
#creating a matrix with a predefined value
np.full((3,5),1.23)

In [None]:
#create an array with a set sequence
np.arange(0, 20, 2)

In [None]:
#create an array of even space between the given range of values
np.linspace(0, 1, 5)

In [None]:
#create a 3x3 array with mean 0 and standard deviation 1 in a given dimension
np.random.normal(0, 1, (3,3))

In [None]:
#create an identity matrix
np.eye(5)

In [None]:
#set a random seed
np.random.seed(0)


x1 = np.random.randint(10, size=4) #one dimension
x2 = np.random.randint(10, size=(3,5)) #two dimension
x3 = np.random.randint(10, size=(3,5,5)) #three dimension


print("x1 ndim:", x1.ndim)
print("x1 shape:", x1.shape)
print("x1 size: ", x1.size)

In [None]:
print("x2 ndim:", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)

In [None]:
print("x3 ndim:", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

### Array Indexing
The important thing to remember is that indexing in python starts at zero.

In [None]:
x1 = np.array([4, 3, 4, 4, 8, 4])
x1

In [None]:
#assess value to index zero
x1[0]

In [None]:
#assess fifth value
x1[4]

In [None]:
#get the last value
x1[-1]

In [None]:
#get the second last value
x1[-2]

In [None]:

#in a multidimensional array, we need to specify row and column index
x2=np.array([[3, 7, 5, 5],[0, 1, 5, 9],[3, 0, 5, 0]])
x2

In [None]:
#3rd row and last value from the 3rd column
x2[2,-1]

In [None]:
#replace value at 0,0 index
x2[0,0] = 12
x2

### Array Slicing
Now, we'll learn to access multiple or a range of elements from an array.

In [None]:
x = np.arange(20)
x

In [None]:
#from start to 4th position
x[:5]

In [None]:
#from 4th position to end
x[4:]

In [None]:
#from 4th to 6th position
x[4:7]

In [None]:
#return elements at even place
x[ : : 2]

In [None]:
#return elements from first position step by two
x[1::2]

In [None]:
#reverse the array
x[::-1]

### Array Concatenation

Many a time, we are required to combine different arrays. So, instead of typing each of their elements manually, you can use array concatenation to handle such tasks easily.

In [None]:
#You can concatenate two or more arrays at once.
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = [21,21,21]
np.concatenate([x, y,z])

In [None]:
#You can also use this function to create 2-dimensional arrays.
grid = np.array([[1,2,3],[4,5,6]])
np.concatenate([grid,grid])

In [None]:
#Using its axis parameter, you can define row-wise or column-wise matrix
np.concatenate([grid,grid],axis=1)

what if you are required to combine a 2D array with 1D array? In such situations, we  can use `np.vstack` or `np.hstack` to do the task. Let's see how!

In [None]:
x = np.array([3,4,5])
grid = np.array([[1,2,3],[17,18,19]])
np.vstack([x,grid])

In [None]:
#Similarly, you can add an array using np.hstack
z = np.array([[9],[9]])
np.hstack([grid,z])

 we can split the arrays based on pre-defined positions. Let's see how!

In [None]:
x = np.arange(10)
x

In [None]:
x1,x2,x3 = np.split(x,[3,6])
print(x1,x2,x3)

In [None]:
grid = np.arange(16).reshape((4,4))
grid

In [None]:
upper,lower = np.vsplit(grid,[2])
print (upper, lower)

In addition, there are several other mathematical functions available in the numpy library such as sum, divide, multiple, abs, power, mod, sin, cos, tan, log, var, min, mean, max, etc. which you can be used to perform basic arithmetic calculations. Feel free to refer to numpy documentation for more information on such functions.

Let's move on to pandas now. Make sure you following each line below because it'll help you in doing data manipulation using pandas.

## Let's start with Pandas

In [None]:
#load library - pd is just an alias. 
import pandas as pd

In [None]:
#create a data frame - dictionary is used here where keys get converted to column names and values to row values.
data = pd.DataFrame({'Country': ['India','Nepal','Pakistan','Bangladesh','Bhutan'],
                    'Rank':[11,40,100,130,101]})
data

In [None]:
#We can do a quick analysis of any data set using:
data.describe()

To get the complete information about the data set, we can use`info()` function.

In [None]:
data.info()

In [None]:
#Let's create another data frame.
data = pd.DataFrame({'group':['x', 'x', 'x', 'y','y', 'y', 'z', 'z','z'],'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

In [None]:
#Let's sort the data frame by ounces - inplace = True will make changes to the data
data.sort_values(by=['ounces'],ascending=True,inplace=False)

We can sort the data by not just one column but multiple columns as well.

In [None]:
data.sort_values(by=['group','ounces'],ascending=[True,False],inplace=False)

In [None]:
#create another data with duplicated rows
data = pd.DataFrame({'k1':['one']*3 + ['two']*4, 'k2':[3,2,1,3,3,4,4]})
data

In [None]:
#sort values 
data.sort_values(by='k2')

In [None]:
#remove duplicates - ta da! 
data.drop_duplicates()

we can also remove duplicates based on a particular column. Let's remove duplicate values from the k1 column.

In [None]:
data.drop_duplicates(subset='k1')

Now, we will learn to categorize rows based on a predefined criteria. It happens a lot while data processing where you need to categorize a variable. 

In [None]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami','corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

In [None]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

def meat_2_animal(series):
    if series['food'] == 'bacon':
        return 'pig'
    elif series['food'] == 'pulled pork':
        return 'pig'
    elif series['food'] == 'pastrami':
        return 'cow'
    elif series['food'] == 'corned beef':
        return 'cow'
    elif series['food'] == 'honey ham':
        return 'pig'
    else:
        return 'salmon'


#create a new variable
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
data

In [None]:
#another way of doing it is: convert the food values to the lower case and apply the function
lower = lambda x: x.lower()
data['food'] = data['food'].apply(lower)
data['animal2'] = data.apply(meat_2_animal, axis='columns')
data

Another way to create a new variable is by using the assign function.

In [None]:
data.assign(new_variable = data['ounces']*14)

Let's remove the column animal2 from our data frame.

In [None]:
data.drop('animal2',axis='columns',inplace=True)
data

 A quick method for imputing missing values is by filling the missing value with any random number. Not just missing values, you may find lots of outliers in your data set, which might require replacing. 

In [None]:
#Series function from pandas are used to create arrays
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

In [None]:
#replace -999 with NaN values
data.replace(-999, np.nan,inplace=True)
data

In [None]:
#We can also replace multiple values at once.
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data.replace([-999,-1000],np.nan,inplace=True)
data

Now, let's learn how to rename column names and axis (row names).

In [None]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),index=['Ohio', 'Colorado', 'New York'],columns=['one', 'two', 'three', 'four'])
data

In [None]:
#Using rename function
data.rename(index = {'Ohio':'SanF'}, columns={'one':'one_p','two':'two_p'},inplace=True)
data

In [None]:
#You can also use string functions
data.rename(index = str.upper, columns=str.title,inplace=True)
data

Learn to categorize (bin) continuous variables.

In [None]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

We'll divide the ages into bins such as 18-25, 26-35,36-60 and 60 and above.

In [None]:
#Understand the output - '(' means the value is included in the bin, '[' means the value is excluded
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

In [None]:
#To include the right bin value, we can do:
pd.cut(ages,bins,right=False)

In [None]:
#Let's check how many observations fall under each bin
pd.value_counts(cats)

Pass a unique name to each label.

In [None]:
bin_names = ['Youth', 'YoungAdult', 'MiddleAge', 'Senior']
new_cats = pd.cut(ages, bins,labels=bin_names)

pd.value_counts(new_cats)

In [None]:
#we can also calculate their cumulative sum
pd.value_counts(new_cats).cumsum()

Let's learn about grouping data and creating pivots in pandas. It's an immensely important data analysis method which you'd probably have to use on every data set you work with.

In [None]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

In [None]:
#calculate the mean of data1 column by key1
grouped = df['data1'].groupby(df['key1'])
grouped.mean()

let's see how to slice the data frame.

In [None]:
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

In [None]:
#get first n rows from the data frame
df[:3]

In [None]:
#slice based on date range
df['20130101':'20130104']

In [None]:
#slicing based on column names
df.loc[:,['A','B']]

In [None]:
#slicing based on both row index labels and column names
df.loc['20130102':'20130103',['A','B']]

In [None]:
#slicing based on index of columns
df.iloc[3] #returns 4th row (index is 3rd)

In [None]:
#returns a specific range of rows
df.iloc[2:4, 0:2]

In [None]:
#returns specific rows and columns using lists containing columns or row indexes
df.iloc[[1,5],[0,2]] 

Similarly, we can do Boolean indexing based on column values as well. This helps in filtering a data set based on a pre-defined condition.

In [None]:
df[df.A > 0.5]

In [None]:
#we can copy the data set
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
df2

In [None]:
#select rows based on column values
df2[df2['E'].isin(['two','four'])]

In [None]:
#select all rows except those with two and four
df2[~df2['E'].isin(['two','four'])]

We can also use a query method to select columns based on a criterion.

In [None]:
#list all columns where A is greater than C
df.query('A > C')

In [None]:
#using OR condition
df.query('A < B | C > A')

Pivot tables are extremely useful in analyzing data using a customized tabular format.

In [None]:
#create a data frame
data = pd.DataFrame({'group': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

In [None]:
#calculate means of each group
data.pivot_table(values='ounces',index='group',aggfunc=np.mean)

In [None]:
#calculate count by each group
data.pivot_table(values='ounces',index='group',aggfunc='count')

Up till now, we've become familiar with the basics of pandas library using toy examples. Now, we'll take up a real-life data set and use our newly gained knowledge to explore it.


## Exploring ML Data Set

We'll work with the popular adult data set.The data set has been taken from UCI Machine Learning Repository. In this data set, the dependent variable is "target." It is a binary classification problem. We need to predict if the salary of a given person is less than or more than 50K. 

In [None]:
#load the data
train  = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

In [None]:
#check data set
train.info()

In [None]:
test.info()

In [None]:
print ("The train data has",train.shape)
print ("The test data has",test.shape)

In [None]:
#Let have a glimpse of the data set
train.head()

Now, let's check the missing values (if present) in this data.

In [None]:
nans = train.shape[0] - train.dropna().shape[0]
print ("%d rows have missing values in the train data" %nans)

nand = test.shape[0] - test.dropna().shape[0]
print ("%d rows have missing values in the test data" %nand)

We should be more curious to know which columns have missing values.

In [None]:
#only 3 columns have missing values
train.isnull().sum()

Let's count the number of unique values from character variables.

In [None]:
cat = train.select_dtypes(include=['O'])
cat.apply(pd.Series.nunique)

Since missing values are found in all 3 character variables, let's impute these missing values with their respective modes.

In [None]:
#Education
train.workclass.value_counts(sort=True)
train.workclass.fillna('Private',inplace=True)


#Occupation
train.occupation.value_counts(sort=True)
train.occupation.fillna('Prof-specialty',inplace=True)


#Native Country
train['native.country'].value_counts(sort=True)
train['native.country'].fillna('United-States',inplace=True)

Let's check again if there are any missing values left.

In [None]:
train.isnull().sum()

Now, we'll check the target variable to investigate if this data is imbalanced or not. 

In [None]:
#check proportion of target variable
train.target.value_counts()/train.shape[0]

 Let's create a cross tab of the target variable with education. With this, we'll try to understand the influence of education on the target variable.

In [None]:
pd.crosstab(train.education, train.target,margins=True)/train.shape[0]

Label encoding this variable will return output as: red = 2 green = 0 blue = 1 pink = 3

In [None]:
#load sklearn and encode all object type variables
from sklearn import preprocessing

for x in train.columns:
    if train[x].dtype == 'object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(train[x].values))
        train[x] = lbl.transform(list(train[x].values))

Let's check the changes applied to the data set.

In [None]:
train.head()

As we can see, all the variables have been converted to numeric, including the target variable. 

In [None]:
#<50K = 0 and >50K = 1
train.target.value_counts()

## Building a Random Forest Model

Let's create a random forest model and check the model's accuracy.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

y = train['target']
del train['target']

X = train
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)

#train the RF classifier
clf = RandomForestClassifier(n_estimators = 500, max_depth = 6)
clf.fit(X_train,y_train)
clf.predict(X_test)

Now, let's make prediction on the test set and check the model's accuracy.

In [None]:
#make prediction and check model's accuracy
prediction = clf.predict(X_test)
acc =  accuracy_score(np.array(y_test),prediction)
print ('The accuracy of Random Forest is {}'.format(acc))

**Hurrah! Our learning algorithm gave 85% accuracy. **

### Summary

This kernel is meant to help anyone who's starting with python to get a taste of data manipulation and a little bit of machine learning using python.

To dive deeper in pandas, check its documentation and start exploring. If you get stuck anywhere, you can drop your questions or suggestions in Comments below. Hope you found this kernel useful.