# Pandas

### Quick Notes:

1. Pandas is a high-level data manipulation tool developed by Wes McKinney. 
2. It is built on the *Numpy* package and its key data structure is called the DataFrame. 
3. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

In [1]:
import numpy as np
import pandas as pd
import sklearn

In [2]:
from sklearn.datasets import load_boston

dataset = load_boston()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target

###### 1. how to find the head or tail of the data set?

In [35]:
df.head(5)

Unnamed: 0,Gender,Age
0,Male,30
1,Female,40
2,Male,50
3,Female,60
4,Male,70


In [36]:
df.tail(5)

Unnamed: 0,Gender,Age
0,Male,30
1,Female,40
2,Male,50
3,Female,60
4,Male,70


###### 2. How to get information about your data set?

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Gender  5 non-null      object
 1   Age     5 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes


###### 3. How to get the dimensions of your data set?

In [38]:
df.shape

(5, 2)

###### 4. How to fetch the total row count of a data set and store it in a vairable?

In [39]:
#There could be many ways to achieve it, few options are below:

In [40]:
total_row_count = len(df['target'].index)

KeyError: 'target'

In [None]:
total_row_count

In [41]:
total_row_count_v2 = len(df.axes[0])

In [42]:
total_row_count_v2

5

###### 5. how to fetch the total number of columns of a data set and store it in a vairable?

In [43]:
total_column_count = len(df.axes[1])

In [44]:
total_column_count

2

###### 6. How to see statistical details of your dataset?

In [45]:
df.describe()

Unnamed: 0,Age
count,5.0
mean,50.0
std,15.811388
min,30.0
25%,40.0
50%,50.0
75%,60.0
max,70.0


###### 7. How to find missing values from a data frame?

In [46]:
# let's create a dummy data frame using below code..

d = {'Employee' : ['X', 'Y', 'Z'], 'Salary' : [1000, 2000, np.nan]}
emp = pd.DataFrame(d)
emp

Unnamed: 0,Employee,Salary
0,X,1000.0
1,Y,2000.0
2,Z,


In [47]:
# find the row where salary is missing..

emp[pd.isnull(emp["Salary"])]

Unnamed: 0,Employee,Salary
2,Z,


###### 8. How to replace misisng Salary in above data set with the average salary?

In [48]:
# fillna() method is being used, note the inplace = True parameter.

emp['Salary'].fillna(value=emp['Salary'].mean(), inplace = True)

In [49]:
emp

Unnamed: 0,Employee,Salary
0,X,1000.0
1,Y,2000.0
2,Z,1500.0


###### 9. How to get the distinct values of a particular column?

In [50]:
import pandas as pd
d = {'State': ['UP', 'UP', 'UP', 'UK', 'UK'], 'City': ['Lucknow', 'Kanpur', 'Agra', 'Dehradun', 'Haldwani'], 'Sale': [1000, 2000, 3000, 4000, 5000]}
df = pd.DataFrame(d)
df

Unnamed: 0,State,City,Sale
0,UP,Lucknow,1000
1,UP,Kanpur,2000
2,UP,Agra,3000
3,UK,Dehradun,4000
4,UK,Haldwani,5000


In [51]:
# get the distinct states..

df['State'].unique()

array(['UP', 'UK'], dtype=object)

###### 10. Sometimes when our data is skewed we need to apply some transformation to bring data points in closer range. How can you apply log transformation in 2 numerical columns of a data set?

In [52]:
d = {"emp" : ['x', 'y', 'z'], "salary": [1000, 2000, 3000], "age" : [40, 50, 60]}

In [53]:
df = pd.DataFrame(d)
df

Unnamed: 0,emp,salary,age
0,x,1000,40
1,y,2000,50
2,z,3000,60


In [54]:
temp = df[['salary', 'age']].apply(lambda x : np.log(x+1))
temp

Unnamed: 0,salary,age
0,6.908755,3.713572
1,7.601402,3.931826
2,8.006701,4.110874


In [55]:
df

Unnamed: 0,emp,salary,age
0,x,1000,40
1,y,2000,50
2,z,3000,60


###### 11. Sometimes you need to normalize your data so that each features are treated equally. It's important especially for those ML Algo which uses Gradient Descent as an optimization techinique e.g. Linear Regression, Logistic Regression or Neural Network and distance based algorithims e.g. KNN, K-Means and SVM. How can we apply MinMaxScaler in your data set?

In [66]:
from sklearn.preprocessing import MinMaxScaler

In [67]:
scaler = MinMaxScaler()

In [68]:
d = {"emp" : ['x', 'y', 'z'], "salary": [1000, 2000, 3000], "age" : [40, 50, 60]}
df = pd.DataFrame(d)
df

Unnamed: 0,emp,salary,age
0,x,1000,40
1,y,2000,50
2,z,3000,60


In [69]:
min_max = scaler.fit_transform(df[['salary', 'age']])

In [70]:
min_max

array([[0. , 0. ],
       [0.5, 0.5],
       [1. , 1. ]])

##### Formula for calculating standardised value is -

$$ X' = (X - Xmin)   /   (Xmax - Xmin) $$

###### 12. ML Algos works well when input data is numeric. How can we create numerical features out of the categorical features using Pandas data frame? 

In [60]:
# let's create a test data frame for our illustration..

dict = {'Gender' : ['Male', 'Female', 'Male', 'Female', 'Male'], 'Age' : [30, 40, 50, 60, 70]}
df = pd.DataFrame(dict)
df

Unnamed: 0,Gender,Age
0,Male,30
1,Female,40
2,Male,50
3,Female,60
4,Male,70


In [61]:
# Let's convert categorical column Gender to one-hot-enconding columns..

ohe = pd.get_dummies(df['Gender'])
ohe

Unnamed: 0,Female,Male
0,0,1
1,1,0
2,0,1
3,1,0
4,0,1


In [62]:
# let's concatenate it with the main data frame..

df_final = pd.concat([df,ohe], axis=1)
df_final

Unnamed: 0,Gender,Age,Female,Male
0,Male,30,0,1
1,Female,40,1,0
2,Male,50,0,1
3,Female,60,1,0
4,Male,70,0,1


In [63]:
# we can also drop the unwanted column, Gender now..

df_final2 = df_final.drop(['Gender'], axis=1)
df_final2

Unnamed: 0,Age,Female,Male
0,30,0,1
1,40,1,0
2,50,0,1
3,60,1,0
4,70,0,1


In [64]:
dict = {'Gender' : ['Male', 'Female', 'Male', 'Female', 'Male'], 'Age' : [30, 40, 50, 60, 70]}
df = pd.DataFrame(dict)
df

Unnamed: 0,Gender,Age
0,Male,30
1,Female,40
2,Male,50
3,Female,60
4,Male,70


In [65]:
ohe = pd.get_dummies(df)
ohe

Unnamed: 0,Age,Gender_Female,Gender_Male
0,30,0,1
1,40,1,0
2,50,0,1
3,60,1,0
4,70,0,1


In [71]:
df

Unnamed: 0,emp,salary,age
0,x,1000,40
1,y,2000,50
2,z,3000,60


In [73]:
df[:1]

Unnamed: 0,emp,salary,age
0,x,1000,40


In [None]:
data['Class'] = np.where(data['Class'] ==2,0,1) #Change the Class representation
data['Class'].value_counts() #Class distribution
