<a href="https://colab.research.google.com/github/subhajitmajumder/python_program/blob/master/me_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Pandas**

- Pandas is an open source library which is on top of Numpy.There are some advanced features from Numpy Library.
- Allows Data Cleaning, fast analysis, and Data Preparation.
- It also enhances performance and productivity.
- It has some built-in visualisation features.

Installation: **pip install pandas**



####Topics

- Series
- Dataframe
- Missing Data
- GroupBy
- Merging, Joining, Concatenating
- Operations
- Data input & output

###**Series**

In [0]:
import numpy as np
import pandas as pd

In [0]:
#Create various series from various object types.

data_1 = [50, 60, 70]
labels = ['a', 'b', 'c']
arr = np.array(data_1)
dic = {'a': 10, 'b': 20, 'c': 30}

In [0]:
pd.Series(data= data_1)  #showing all the data in data_1 with proper index. 

0    50
1    60
2    70
dtype: int64

In [0]:
pd.Series(data= data_1, index= labels) #Set 'labels' list as index.

a    50
b    60
c    70
dtype: int64

In [0]:
pd.Series(data_1, labels) #No need to cast.

a    50
b    60
c    70
dtype: int64

In [0]:
#Passing any numpy array through pd.

pd.Series(data=arr)

0    50
1    60
2    70
dtype: int64

In [0]:
pd.Series(arr, labels)  #No need to cast once again.

a    50
b    60
c    70
dtype: int64

In [0]:
#Passing dictionaries in pandas

pd.Series(dic) #Index in left and values in right

a    10
b    20
c    30
dtype: int64

We can pass any type of data object as well as built-in functions.

In [0]:
pd.Series(data = [print, len, sum])

0    <built-in function print>
1      <built-in function len>
2      <built-in function sum>
dtype: object

In [0]:
series1 = pd.Series([1, 2, 3, 4], ['Pizza', 'chicken', 'Rice', 'Ghee'])
print(series1) #Here names are the indexes hich are the strings.

Pizza      1
chicken    2
Rice       3
Ghee       4
dtype: int64


In [0]:
series2 = pd.Series([2, 1, 4, 3], ['Rice', 'Pizza', 'Mutton', 'Ghee'])
print(series2)

Rice      2
Pizza     1
Mutton    4
Ghee      3
dtype: int64


In [0]:
#Can show the values by searching indexes.

series1['Rice']

3

In [0]:
series2['Ghee']

3

In [0]:
series1 + series2 #Adds the values of both series when it gets common else shows NaN

Ghee       7.0
Mutton     NaN
Pizza      2.0
Rice       5.0
chicken    NaN
dtype: float64

###**Dataframes**

In [0]:
from numpy.random import randn

In [0]:
np.random.seed(101) #.seed helps to provide same random numbers everytime.

- Creating a DataFrame

In [0]:
df = pd.DataFrame(randn(5, 4), ['A', 'B', 'C', 'D', 'E'],['W', 'X', 'Y', 'Z'])

In [0]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
type(df)

pandas.core.frame.DataFrame

- Indexing & Selection

In [0]:
df['W']  #Gets the W column.

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [0]:
type(df['W'])

pandas.core.series.Series

In [0]:
#Get multiple columns.
df[['W', 'Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [0]:
#Create new columns

df['New_Col'] = df['W'] + df['Z']  #need to add previous columns(any) to create a new column.
df

Unnamed: 0,W,X,Y,Z,New_Col
A,2.70685,0.628133,0.907969,0.503826,3.210676
B,0.651118,-0.319318,-0.848077,0.605965,1.257083
C,-2.018168,0.740122,0.528813,-0.589001,-2.607169
D,0.188695,-0.758872,-0.933237,0.955057,1.143752
E,0.190794,1.978757,2.605967,0.683509,0.874303


In [0]:
#Delete a column.

df.drop('New_Col', axis=1) #Axis method refers to rows or columns in a sheet.axis=0 is row and axis=1 is column

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
df

Unnamed: 0,W,X,Y,Z,New_Col
A,2.70685,0.628133,0.907969,0.503826,3.210676
B,0.651118,-0.319318,-0.848077,0.605965,1.257083
C,-2.018168,0.740122,0.528813,-0.589001,-2.607169
D,0.188695,-0.758872,-0.933237,0.955057,1.143752
E,0.190794,1.978757,2.605967,0.683509,0.874303


- **Drop** method doesnt affects the original dataframe.Pandas doesnt help users to loose data during any adjustments of Dataset.So to drop a column from original dataset there is a method called **inplace** which will drop the column from the original dataset when it is being set to True. 

In [0]:
df.drop('New_Col', axis=1, inplace=True)

In [0]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
#Dropping a row.

df.drop('E') #Here no need to use axis=0 as it is the default value.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [0]:
df  #As we didnt use inplace method so actual row is not being dropped.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
df.shape #Get rows and columns count.

(5, 4)

In [0]:
#Get rows from a dataframe.

df.loc['A']  #note that a row is also a series.Pandas always returns a series either for rows or columns 

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [0]:
#Get multiple rows from a dataframe.

df.loc[['A', 'B']]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965


In [0]:
#Get row from a dataframe by indexing.

df.iloc[2] #Gets row C as the index of row C is 2.

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [0]:
#Get single cell from a dataframe.
df.loc['C', 'Z'] 

-0.5890005332865824

In [0]:
#Get multiple cells at a time.

print(df)
print('\n')
df.loc[['A', 'D'], ['X', 'Z']]

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509




Unnamed: 0,X,Z
A,0.628133,0.503826
D,-0.758872,0.955057


##**Dataframes Part 2**

In [0]:
np.random.seed(101)

In [0]:
df = pd.DataFrame(randn(5, 4), ['A', 'B', 'C', 'D', 'E'],['W', 'X', 'Y', 'Z'])

In [0]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


We can perform **Conditional Selection** which is one of the important features of Pandas Library.

In [0]:
df >0 

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [0]:
bool_df = df > 0
bool_df

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [0]:
df[bool_df] #Showing original values when it is true and NaN when it is false.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
#Filtering rows with the help of column.
df['W'] > 0

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [0]:
df['W'] #Check this output with previous code block's output.
#The false result is in this case negative and others are positive.

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [0]:
#Another way to write the above code.We can filter out the rows based on the column value.

df[df['W'] > 0]  #In this case we only get those rows where the column W's value is > 0.Here Row C is excluded.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
#Grab all the rows in this dataframe where Z is < 0.

result_df = df[df['Z'] > 0]
result_df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
# Grab columns from result_df.

result_df[['Y', 'W']]

Unnamed: 0,Y,W
A,0.907969,2.70685
B,-0.848077,0.651118
D,-0.933237,0.188695
E,2.605967,0.190794


In the above two cells we grab some rows depending on some general conditions.We can also grab one or multiple columns depending on the conditions which I have shown in cell no. 52. This entire process which is shown above can be executed in a single line, which I am showing below.

In [0]:
#Garb all the rows where 'W' > 0 and after satisfying the condition grab 'W' & 'Y'.

df[df['W'] > 0] [['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077
D,0.188695,-0.933237
E,0.190794,2.605967


We can also use multiple conditions in pandas.For using multiple conditions we need either AND or OR operator.In case of Pandas library for AND we will use & and for OR we will use | sign.Check out the example below.

NB: '|' this operator is called pipe operatorr and is placed above the Enter key in any standard keyboard.

In [103]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [104]:
 df[(df['W'] > 0) & (df['X'] > 1)]  #Using & operator which is actually defined as AND

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


In [106]:
df[(df['X'] > 0) | (df['Y'] > 1)]  #Using | operator which is actually defined as OR

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
C,-2.018168,0.740122,0.528813,-0.589001
E,0.190794,1.978757,2.605967,0.683509


###**More about Index**

In [107]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [109]:
#Reset the index.
df.reset_index()  #Here index resets to a column.

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


We can insert a column in our dataframe from a string or from a list.Note that it will only execute if number of rows are same. Check the example below.

In [0]:
col_1 = "GH JK MN KO LM".split()  #Creating a list from a string.

In [118]:
col_1  # A list from string.

['GH', 'JK', 'MN', 'KO', 'LM']

In [119]:
df['new_column1'] = col_1 # Create a new column in our dataframe.
df

Unnamed: 0,W,X,Y,Z,new_column1
A,2.70685,0.628133,0.907969,0.503826,GH
B,0.651118,-0.319318,-0.848077,0.605965,JK
C,-2.018168,0.740122,0.528813,-0.589001,MN
D,0.188695,-0.758872,-0.933237,0.955057,KO
E,0.190794,1.978757,2.605967,0.683509,LM
