<a href="https://colab.research.google.com/github/sagar9926/Python_For_DataScience/blob/master/PythonTraining/Python_Session_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of Content:

* Pandas
    + Importing Data
    + Creating Test Object
    + Viewing Data
    + Selection
    + Data Cleaning
    + Filter, Sort & Group by
    + Iteration
    + Join, Merging
    + Statistics
    + Visualization
    + Exporting Data

# Pandas

Pandas is an open source data analysis library for providing easy-to-use data structures and data analysis tools.

__DataFrame__ is a mXn vector where
* m is the number of rows
* n is the number of columns

__Series__ is a mX1 vector. Hence, each column in DataFrame is known as a pandas series.

__NOTE__ 
* df - A pandas DataFrame object
* s - A pandas Series object

## 1. Importing Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
!git clone https://github.com/venky14/Machine-Learning-with-Iris-Dataset.git

Cloning into 'Machine-Learning-with-Iris-Dataset'...
remote: Enumerating objects: 43, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 43 (delta 8), reused 0 (delta 0), pack-reused 26[K
Unpacking objects: 100% (43/43), done.


In [3]:
df = pd.read_csv('/content/Machine-Learning-with-Iris-Dataset/Iris.csv') #read from csv
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
9,10,4.9,3.1,1.5,0.1,Iris-setosa


In [14]:
pd.read_csv('/content/sample.txt',sep = ' ')

Unnamed: 0,Sagar,Kumar,Agrawal
0,1,2,3
1,4,5,6
2,7,8,9
3,10,11,12


Other ways of importing data depending on the file type.

* __pd.read_table(filename)__ - From a delimited text file (like TSV)
* __pd.read_excel(filename)__ - From an Excel file 
* __pd.read_sql(query, connection_object)__ - Reads from a SQL table/database
* __pd.read_json(json_string)__ - Reads from a JSON file and extracts tables to a list of dataframes

## 2. Create Test Objects
* __pd.DataFrame(dict)__ - From a dict, keys for columns names, values for data as lists
* __pd.DataFrame(np.random.rand(20,5))__ - 5 columns and 20 rows of random floats
* __pd.Series(my_list)__ - Creates a series from an iterable my_list

In [11]:
d = {'Sagar' : [1,2,3,4],"Priya" : [10,20,30,40],"Batul" :[100,200,300,400]}
print(d)
pd.DataFrame(d)

{'Sagar': [1, 2, 3, 4], 'Priya': [10, 20, 30, 40], 'Batul': [100, 200, 300, 400]}


Unnamed: 0,Sagar,Priya,Batul
0,1,10,100
1,2,20,200
2,3,30,300
3,4,40,400


In [12]:
df_dict = pd.DataFrame(columns=['City','State'], data = [['Kolkata','West Bengal'], ['Bangalore','Karnataka']])
df_dict

Unnamed: 0,City,State
0,Kolkata,West Bengal
1,Bangalore,Karnataka


## 3. Viewing Data

* __df.head(n)__ - First n rows of the DataFrame [__replace head with tail__, you know what you will get]
* __df.shape__ - Number of rows and columns 
* __df.info()__ - Index, Datatype and Memory 
* __df.describe()__ - Summary statistics for numerical columns
* __df.apply(pd.Series.value_counts)__ - Unique values and counts for all columns

__s.value_counts(dropna=False)__ - Views unique values and counts

In [15]:
print("df shape\n")
print(df.shape)
print("\n================")
print("df info\n")
df.info()

df shape

(150, 6)

df info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [16]:
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


## 4. Selection

* __df[col]__ or __df.col__- Returns column with label col as Series
* __df[[col1, col2]]__ - Returns Columns as a new DataFrame
* __s.iloc[0]__ - Selection by position/Integer-based indexing
* __s.loc[0]__ - Selection by index/label-based indexing
* __df.loc[:, :]__ and __df.iloc[:, :]__ - First argument represents the number of rows and the second for columns

```
# Single selections using iloc and DataFrame
# Rows:
df.iloc[0] # first row of data frame
df.iloc[1] # second row of data frame
df.iloc[-1] # last row of data frame
# Columns:
df.iloc[:,0] # first column of data frame
df.iloc[:,1] # second column of data frame
df.iloc[:,-1] # last column of data frame
```

In [18]:
# rows 0 to 4; selective columns using iloc
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [22]:
df['SepalLengthCm']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: SepalLengthCm, Length: 150, dtype: float64

In [24]:
df[['Id','SepalLengthCm']]

Unnamed: 0,Id,SepalLengthCm
0,1,5.1
1,2,4.9
2,3,4.7
3,4,4.6
4,5,5.0
...,...,...
145,146,6.7
146,147,6.3
147,148,6.5
148,149,6.2


In [26]:
df.iloc[:5 ,:2]

Unnamed: 0,Id,SepalLengthCm
0,1,5.1
1,2,4.9
2,3,4.7
3,4,4.6
4,5,5.0


In [28]:
df.iloc[:,1] # first column of data frame

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: SepalLengthCm, Length: 150, dtype: float64

In [29]:
# rows 0 to 4; all columns
df.loc[0:4,:] # : for columns is optional here since we are asking for all columns

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [32]:
df.iloc[0:4,:]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa


In [30]:
# rows 0 to 4; selective columns
df.loc[0:4,['Id','SepalWidthCm']]

Unnamed: 0,Id,SepalWidthCm
0,1,3.5
1,2,3.0
2,3,3.2
3,4,3.1
4,5,3.6


In [31]:
df.iloc[0:4,['Id','SepalWidthCm']]

IndexError: ignored

__NOTE__: 
* In __loc__, we are mentioning the column names for selection, while in __iloc__ we are specifying the column number
* In __loc__, rows are getting printed including the upper bound, while in __iloc__, it is excluding it

Also __NOTE__ the following:
1. For creating a new DataFrame using column names
```
df[[col1, col2]]
```
is same as
```
df.loc[:,[col1, col2]]
```
2. For printing the first 5 rows of the DataFrame
```
df[0:n]
```
is same as
```
df.iloc[0:n, :]
```

## 5. Data Cleaning


* __df.drop([col1, col2, col3], inplace = True, axis=1)__ - Remove set of column(s)
* __df.columns = ['a','b','c']__ - Renames columns
* __df.isnull()__ - Checks for null Values, Returns Boolean DataFrame
* __df.isnull().any()__ - Returns boolean value for each column, gives True if any null value detected corresponding to that column
* __df.dropna()__ - Drops all rows that contain null values
* __df.dropna(axis=1)__ - Drops all columns that contain null values
* __df.fillna(x)__ - Replaces all null values with x
* __s.replace(1,'one')__ - Replaces all values equal to 1 with 'one'
* __s.replace([1,3], ['one','three'])__ - Replaces all 1 with 'one' and 3 with 'three'
* __df.rename(columns = lambda x: x + '_1')__ - Mass renaming of columns
* __df.rename(columns = {'old_name': 'new_name'})__ - Selective renaming
* __df.rename(index = lambda x: x + 1)__ - Mass renaming of index
* __df[new_col] = df.col1 + ', ' + df.col2__ - Add two columns to create a new column in the same DataFrame
* __df.drop_duplicates(keep='last',inplace=True)__ - Remove duplicate rows

In [33]:
df_copy = df.copy() #call by value


In [34]:
print(hex(id(df)))
print(hex(id(df_copy)))

0x7fce8f13d4d0
0x7fce8ad11410


In [40]:
df_copy.drop(['SepalWidthCm'], inplace=True, axis = 1)
df_copy.head()

Unnamed: 0,Id,SepalLengthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,1.4,0.2,Iris-setosa
1,2,4.9,1.4,0.2,Iris-setosa
2,3,4.7,1.3,0.2,Iris-setosa
3,4,4.6,1.5,0.2,Iris-setosa
4,5,5.0,1.4,0.2,Iris-setosa


In [42]:
df.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [43]:
df_any_null = df.isnull().any()
df_any_null

Id               False
SepalLengthCm    False
SepalWidthCm     False
PetalLengthCm    False
PetalWidthCm     False
Species          False
dtype: bool

In [45]:
#help(df.dropna)

In [46]:
df.dropna(axis=0, inplace=True)
df_check_null = df.isnull().sum()
print(df.shape)
df_check_null

(150, 6)


Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [47]:
print(df.shape)

(150, 6)


In [49]:
df_new_cols_name = df.rename(columns = lambda x: (x.lower()))
df_new_cols_name.head()

Unnamed: 0,id,sepallengthcm,sepalwidthcm,petallengthcm,petalwidthcm,species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [50]:
len(df)

150

In [None]:
len_df_before = len(df_new_cols_name)
print('Before removing duplicates, the length of dataframe: {}'.format(len_df_before))

# keep : {‘first’, ‘last’, False}, default ‘first’
# *first : Drop duplicates except for the first occurrence.
# *last : Drop duplicates except for the last occurrence.
# *False : Drop all duplicates.
df_new_cols_name.drop_duplicates(keep='last',inplace=True)

display(df_new_cols_name.head(10).style.highlight_null(null_color='blue'))

len_df_after = len(df_new_cols_name)
print('After removing duplicates, the length of dataframe: {}'.format(len_df_after))

## 6. Filter, Sort & Group By
* __df[df[col] > 0.5]__ - Rows where the values in col > 0.5
* __df[(df[col] > 0.5) & (df[col] < 0.7)]__ - Rows where 0.7 > col > 0.5
* __df.sort_values(col1)__ - Sorts values by col1 in ascending order
* __df.sort_values(col2,ascending=False)__ - Sorts values by col2 in descending order
* __df.sort_values([col1,col2],ascending=[True,False])__ - Sorts values by col1 in ascending order then col2 in descending order

* __df.groupby(col)__ - Returns a groupby object for values from one column
* __df.groupby([col1,col2])__ - Returns a groupby object values from multiple columns
* __df.groupby(col1)[col2].mean()__ - (Aggregation) Returns the mean of the values in col2, grouped by the values in col1 
* __df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean)__ - Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
* __df.apply(np.mean)__ - Applies a function across each column
* __df.apply(np.max, axis=1)__ - Applies a function across each row
* __df.applymap(lambda arg(s): expression)__ - Apply the expression on each value of the DataFrame
* __df[col].map(lambda arg(s): expression)__ - Apply the expression on each value of the column col