# Introduction to Numpy and Pandas

## Index
### - [Numpy](#numpy)
### - [Pandas](#pandas)

<a id='numpy'></a>
## Numpy
Numpy is an array like data structure aimed at doing mathematical computations faster.

##### Array declaration

A common array can be represented as,
```
arr = [1, 2, 3, 4, 5, 6]
# We can convert that to a numpy array as,
np_arr = np.array(arr)
```
Looks nothing special?
Read on!

Numpy array is meant to do complex numerical matrix operations by leveraging [vectorisation](https://en.wikipedia.org/wiki/Vectorization). Looping over an array can take a lot of time. By efficiently using vectorisation, we can do matrix operations a lot faster. This can be used in scientific computing, machine learning and data processing.

In [1]:
# import the numpy array as follow to get started.
import numpy as np

#### Speed
Like mentioned, Speed is a major advantage for a numpy array. Let us verify this fact by taking two arrays and looping over them and doing a vector operation.

In [20]:
# we define an array which contains numbers from 1- 100000
arr = []
for i in range(1, 10000001):
    arr.append(i)

In [17]:
# we declare the numpy array for the same.
numpy_arr = np.array(arr)

In [18]:
# We are going to double the values in both the array and clock the time taken by both the operations.
import time

In [29]:
# For a normal Array

start_time = time.time()

for i in range(0, 1000000):
    arr[i] = 2*arr[i]

total_time_norm = time.time()-start_time
print('Ending time = {}'.format(total_time_norm))

Ending time = 0.11598682403564453


In [30]:
# For a numpy array.

start_time = time.time()

numpy_arr = 2*numpy_arr

total_time_np = time.time()-start_time
print('Ending time = {}'.format(total_time_np))

Ending time = 0.033846378326416016


In [31]:
total_time_norm/total_time_np

3.4268607092038716

##### Result
As we can see, the time taken is 3 times less when we use vectorisation operations. So it makes sense to process the data using numpy instead of a regular array.

#### Other features

The main features of numpy besides being able to do vector math, is that it has built in fuctions to compute statistical values like mean, median, sum, etc.

In [34]:
#Computing the mean
numpy_arr.mean()

160000016.0

You can read more about [numpy](https://docs.scipy.org/doc/). If that looks too overwhelming, you can refer the free Udacity course I had mentioned. They provide a very detailed overview about numpy. And the rest, comes with practice.

<a id='pandas'></a>
## Pandas

Numpy just made computation faster. Can life get more easier? Yes it can. And that is where pandas comes into play.
Pandas provides a data structure called pandas frame for exclusively storing spread sheet style data, processing as fast as numpy and also giving much more functionality and easy syntax! How cool right? Let's dive in.

#### Loading in csv's and making our lives easier.

The last notebook was all about opening and reading the document and processing with Loops after Loops after Loops!
Well in this notebook, we will show you how you can do that in a single line.

In [35]:
#import the pandas library.

import pandas as pd

In [36]:
# just like arrays in numpy are called numpy arrays, the cells in pandas are called DataFrame and oftern denoted as df.

loan_data_df = pd.read_csv('Data/test.csv')

#### Voila! it's done!!

In [49]:
# This is how you print the first 3 lines of the read elements in pandas. Default count is 5.
loan_data_df.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Education,ApplicantIncome,LoanAmount,Property_Area
0,LP001015,Male,Yes,Graduate,5720,110,Urban
1,LP001022,Male,Yes,Graduate,3076,126,Urban
2,LP001031,Male,Yes,Graduate,5000,208,Urban


#### A spreadsheet for Python
Yes! And I promise you, pandas is to make our lives a lot easier. You can literally do everything that you can do on a spreadsheet on a single line. Let's look into the functionalities

#### Accessing elements in pandas by column names.

Like we discussed when it comes to a dictionary. A column in pandas can be accessed by specifying the column name as follows.
```
df[column_name]
```

In [43]:
# Getting the values of a column by name.
loan_data_df['Loan_ID'].head()

0    LP001015
1    LP001022
2    LP001031
3    LP001035
4    LP001051
Name: Loan_ID, dtype: object

#### Accessing elements in pandas (Indexing)

The values in a pandas dataframe can be accessed using the iloc attribute in a dataframe as follows,
```
df.iloc[row, column]
```
for row and columns you can specify slicies like you do for a general python list or you can select the entire range by using the ':' instead of a value.

In [39]:
# Eg: for selecting the first five rows of the first column
loan_data_df.iloc[0:5, 0]

0    LP001015
1    LP001022
2    LP001031
3    LP001035
4    LP001051
Name: Loan_ID, dtype: object

#### Similarities to numpy array.

A pandas dataframe can be used for doing computations as well, in a similar fashion to numpy. 


In [50]:
# finding the mean of applicant incomes.
loan_data_df['ApplicantIncome'].mean()

4729.0625

In [52]:
# also vectorising operations like addition, subtraction multiplication etc.
sum_loan = loan_data_df['ApplicantIncome'] + loan_data_df['LoanAmount']
sum_loan.head(3)

0    5830
1    3202
2    5208
dtype: int64

#### The most important feature however is, 'groupby()'.
What would you do if you wanted to group all the elements of the dataframe by some particular category. Say, if you wanted to classify our data into male applicants and female applicants. How would you do it? Well you know its over complicating in pandas when you suggest looping and other ways to do computations. 

Pandas has a built in function to group elements by a particular value in a column and then do computation like sum, mean and count on that.

```
### General syntax
df.groupby('column_name').count()
### This will give us a count of different categories in the data set.
```

In [54]:
# Example to get the count of male and female applicants.
loan_data_df.groupby('Gender').count()

Unnamed: 0_level_0,Loan_ID,Married,Education,ApplicantIncome,LoanAmount,Property_Area
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,2,2,2,2,2,2
Male,14,14,14,14,14,14


In [56]:
# Ex 2. To get the count of the type of Property Area.
loan_data_df.groupby('Property_Area').count()

Unnamed: 0_level_0,Loan_ID,Gender,Married,Education,ApplicantIncome,LoanAmount
Property_Area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Rural,1,1,1,1,1,1
Semiurban,5,5,5,5,5,5
Urban,10,10,10,10,10,10


#### More functionality
If this won't suffice, pandas offers a functionality called apply() Where you can define custom functions and apply that to the entire data without looping. Like I said, Why loop when you can $Vectorise$.

However, I don't plan to confuse the new readers with such advanced topics. I highly suggest that you take the online course on Udacity. It is very good to get going on Data Analysis using python.

### To conclude.
Don't worry if everything sounds overwhelming altogether. Patience is a virtue when it comes to Data Science and with practice you can master this. In the next notebook, we will be looking into a data set and see how we can analyse and process it using these libraries.