<a href="https://colab.research.google.com/github/sohail004/Python-for-Data-Science/blob/main/Copy_of_Intro_to_Pandas_blanks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright 2021, Zaka AI, Inc. All Rights Reserved.

# Introduction to Pandas
---

**Objective:**

In this exercise, you will be covering basics of Pandas library (*Python Data Analysis Library provides high performance and easy to use data structures*).
Pandas stands for “Python Data Analysis Library”.Pandas is quite a game changer when it comes to analyzing data with Python and it is one of the most preferred and widely used tools in data munging/wrangling if not THE most used one. Pandas is an open source, free to use (under a BSD license). 

Source can be found [here](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673) and on this [link](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) as well.

In [5]:
#importing Pandas library
import pandas as pd
import numpy as np

## Object Creation

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.) 

Creating a Series by passing a list of values, letting pandas create a default integer index:

In [6]:
s = pd.Series(dict(row1=[1, 3, 5, np.nan, 6, 8, 'boy'],row2=[12,32,1233,1234,'man',12,3]))
print(s)
print(s.shape)

row1           [1, 3, 5, nan, 6, 8, boy]
row2    [12, 32, 1233, 1234, man, 12, 3]
dtype: object
(2,)


DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:


In [7]:
dates = pd.date_range('20130101', periods=6)
print(dates)


DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


### Import our Data

We are looking at a dataset with the following columns:
  - GRE Scores ( out of 340 )
  - TOEFL Scores ( out of 120 )
  - University Rating ( out of 5 )
  - Statement of Purpose ( out of 5 )
  - Letter of Recommendation Strength ( out of 5 )
  - Undergraduate GPA ( out of 10 )
  - Research Experience ( either 'yes' or 'no' )

The data is stored in a csv (Comma Separated Values) file. To load the data to our code, we use pandas module, more specifically, the read_csv function.

### Path To Data:

We need to sepcify the path to our dataset from Google Drive:

1.   Access the file in the following link and add it to your drive
[here](https://drive.google.com/open?id=1x9ipy9mtI0AniLSmMaKsfHGwDeoxq60u)

2.   Navigate to the drive file by mounting drive and entering your authorization code
```python
from google.colab import drive
drive.mount('/content/drive')
```

3.   Find the path to the dataset folder
![path_to_data](https://drive.google.com/uc?id=1QHRBoqGLaiOpDp9qFm0UF5wjgkZBgsBC)

4. Right-click on the folder Pre-bootcamp Exercises and click copy path

5. Finally, paste the copied path into the following line of code. Remember to keep the start and end backslashes
```python
Path_to_data = '/ ...copied path... /'
```

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
Path_to_data = '/content/drive/My Drive/'

In [10]:
df = pd.read_csv(Path_to_data + "admission_dataset.csv")
df

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,yes,0.92
1,2,324,107,4,4.0,4.5,8.87,yes,0.76
2,3,316,104,3,3.0,3.5,8.00,yes,0.72
3,4,322,110,3,3.5,2.5,8.67,yes,0.80
4,5,314,103,2,2.0,3.0,8.21,no,0.65
...,...,...,...,...,...,...,...,...,...
495,496,332,108,5,4.5,4.0,9.02,yes,0.87
496,497,337,117,5,5.0,5.0,9.87,yes,0.96
497,498,330,120,5,4.5,5.0,9.56,yes,0.93
498,499,312,103,4,4.0,5.0,8.43,no,0.73


Notice the columns of the resulting DataFrame have different dtypes.

In [11]:

print(df.shape)
print(df.dtypes)

(500, 9)
Serial No.             int64
GRE Score              int64
TOEFL Score            int64
University Rating      int64
SOP                  float64
LOR                  float64
CGPA                 float64
Research              object
Chance of Admit      float64
dtype: object


## Viewing Data

Here is how to view the top and bottom rows of the frame:

In [12]:
df.head() # view first 5 rows of dataframe            

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,yes,0.92
1,2,324,107,4,4.0,4.5,8.87,yes,0.76
2,3,316,104,3,3.0,3.5,8.0,yes,0.72
3,4,322,110,3,3.5,2.5,8.67,yes,0.8
4,5,314,103,2,2.0,3.0,8.21,no,0.65


In [13]:
df.head(3) # view first 3 rows of dataframe

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,yes,0.92
1,2,324,107,4,4.0,4.5,8.87,yes,0.76
2,3,316,104,3,3.0,3.5,8.0,yes,0.72


In [14]:
print(df.tail(2))
df.loc[1:3] # view last 2 rows

     Serial No.  GRE Score  TOEFL Score  University Rating  SOP  LOR  CGPA  \
498         499        312          103                  4  4.0  5.0  8.43   
499         500        327          113                  4  4.5  4.5  9.04   

    Research  Chance of Admit  
498       no             0.73  
499       no             0.84  


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
1,2,324,107,4,4.0,4.5,8.87,yes,0.76
2,3,316,104,3,3.0,3.5,8.0,yes,0.72
3,4,322,110,3,3.5,2.5,8.67,yes,0.8


Display the index:

In [15]:
df.index

RangeIndex(start=0, stop=500, step=1)

In [16]:
df.index

RangeIndex(start=0, stop=500, step=1)

describe() shows a quick statistic summary of your data:

In [17]:
df.describe

<bound method NDFrame.describe of      Serial No.  GRE Score  TOEFL Score  University Rating  SOP  LOR  CGPA  \
0             1        337          118                  4  4.5  4.5  9.65   
1             2        324          107                  4  4.0  4.5  8.87   
2             3        316          104                  3  3.0  3.5  8.00   
3             4        322          110                  3  3.5  2.5  8.67   
4             5        314          103                  2  2.0  3.0  8.21   
..          ...        ...          ...                ...  ...  ...   ...   
495         496        332          108                  5  4.5  4.0  9.02   
496         497        337          117                  5  5.0  5.0  9.87   
497         498        330          120                  5  4.5  5.0  9.56   
498         499        312          103                  4  4.0  5.0  8.43   
499         500        327          113                  4  4.5  4.5  9.04   

    Research  Chance of Admit

Sorting by values:

In [18]:
# sort by descending order of Chance of Admission
df.sort_values(by='Chance of Admit', ascending=True) # Notice: updates not stored in dataframe
df.sort_index(ascending= False)

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
499,500,327,113,4,4.5,4.5,9.04,no,0.84
498,499,312,103,4,4.0,5.0,8.43,no,0.73
497,498,330,120,5,4.5,5.0,9.56,yes,0.93
496,497,337,117,5,5.0,5.0,9.87,yes,0.96
495,496,332,108,5,4.5,4.0,9.02,yes,0.87
...,...,...,...,...,...,...,...,...,...
4,5,314,103,2,2.0,3.0,8.21,no,0.65
3,4,322,110,3,3.5,2.5,8.67,yes,0.80
2,3,316,104,3,3.0,3.5,8.00,yes,0.72
1,2,324,107,4,4.0,4.5,8.87,yes,0.76


Notice the following:

In [19]:
print(df)
df=df.sort_values(by='Chance of Admit',ascending=False) # should assign changes to dataframe
print(df)

     Serial No.  GRE Score  TOEFL Score  University Rating  SOP  LOR  CGPA  \
0             1        337          118                  4  4.5  4.5  9.65   
1             2        324          107                  4  4.0  4.5  8.87   
2             3        316          104                  3  3.0  3.5  8.00   
3             4        322          110                  3  3.5  2.5  8.67   
4             5        314          103                  2  2.0  3.0  8.21   
..          ...        ...          ...                ...  ...  ...   ...   
495         496        332          108                  5  4.5  4.0  9.02   
496         497        337          117                  5  5.0  5.0  9.87   
497         498        330          120                  5  4.5  5.0  9.56   
498         499        312          103                  4  4.0  5.0  8.43   
499         500        327          113                  4  4.5  4.5  9.04   

    Research  Chance of Admit  
0        yes             0.92  

## Selection

In [20]:
df['Research'] # selecting column by name

202    yes
143    yes
24     yes
203    yes
71     yes
      ... 
457     no
94      no
58     yes
92      no
376     no
Name: Research, Length: 500, dtype: object

In [21]:
df[0:2] # selecting rows 

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
202,203,340,120,5,4.5,4.5,9.91,yes,0.97
143,144,340,120,4,4.5,4.0,9.92,yes,0.97


***loc*** gets rows (or columns) with particular labels from the index. 

***iloc*** gets rows (or columns) at particular positions in the index (so it only takes integers).

In [22]:
print(df.loc[4,['GRE Score', 'CGPA','Chance of Admit']]) # selecting row at index label 0
print(df.iloc[24:5])

GRE Score           314
CGPA               8.21
Chance of Admit    0.65
Name: 4, dtype: object
Empty DataFrame
Columns: [Serial No., GRE Score, TOEFL Score, University Rating, SOP, LOR, CGPA, Research, Chance of Admit]
Index: []


In [23]:
df.iloc[0:2] # selecting specific column entries from row at index label 0

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
202,203,340,120,5,4.5,4.5,9.91,yes,0.97
143,144,340,120,4,4.5,4.0,9.92,yes,0.97


In [24]:
df.iloc[3:5, 1:,] # selection by index

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
203,334,120,5,4.0,5.0,9.87,yes,0.97
71,336,112,5,5.0,5.0,9.76,yes,0.96


In [25]:
# select entries (or rows) from dataframe where Chance of Admission is greater than 50%
df.iloc[0: , 8, ]
df[df['Chance of Admit']> 0.5]
df[df['CGPA']>8.50]
df[df['TOEFL Score'].isin([112,120])]

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
202,203,340,120,5,4.5,4.5,9.91,yes,0.97
143,144,340,120,4,4.5,4.0,9.92,yes,0.97
203,204,334,120,5,4.0,5.0,9.87,yes,0.97
71,72,336,112,5,5.0,5.0,9.76,yes,0.96
81,82,340,120,4,5.0,5.0,9.5,yes,0.96
212,213,338,120,4,5.0,5.0,9.66,yes,0.95
34,35,331,112,5,4.0,5.0,9.8,yes,0.94
284,285,340,112,4,5.0,4.5,9.66,yes,0.94
25,26,340,120,5,4.5,4.5,9.6,yes,0.94
497,498,330,120,5,4.5,5.0,9.56,yes,0.93


In [26]:
# Using the isin() method for filtering:

# select only the University Rate that is 1 or 2
df[df['University Rating'].isin([1,2])]

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
141,142,332,118,2,4.5,3.5,9.36,yes,0.90
140,141,329,110,2,4.0,3.0,9.15,yes,0.84
145,146,320,113,2,2.0,2.5,8.64,yes,0.81
359,360,321,107,2,2.0,1.5,8.44,no,0.81
448,449,312,109,2,2.5,4.0,9.02,no,0.80
...,...,...,...,...,...,...,...,...,...
375,376,304,101,2,2.0,2.5,7.66,no,0.38
457,458,295,99,1,2.0,1.5,7.57,no,0.37
58,59,300,99,1,3.0,2.0,6.80,yes,0.36
92,93,298,98,2,4.0,3.0,8.03,no,0.34


## Operations 

### Statistics

In [27]:
df.mean() # Performing average over columns

  """Entry point for launching an IPython kernel.


Serial No.           250.500000
GRE Score            316.472000
TOEFL Score          107.192000
University Rating      3.114000
SOP                    3.374000
LOR                    3.487952
CGPA                   8.576440
Chance of Admit        0.721740
dtype: float64

In [28]:
# Performing average over rows
df.mean(1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 202 to 376
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Serial No.         500 non-null    int64  
 1   GRE Score          500 non-null    int64  
 2   TOEFL Score        500 non-null    int64  
 3   University Rating  500 non-null    int64  
 4   SOP                500 non-null    float64
 5   LOR                498 non-null    float64
 6   CGPA               500 non-null    float64
 7   Research           500 non-null    object 
 8   Chance of Admit    500 non-null    float64
dtypes: float64(4), int64(4), object(1)
memory usage: 55.2+ KB


  
