# Pandas

## Quick Notes:

1. Pandas is a high-level data manipulation tool developed by Wes McKinney. 
2. It is built on the *Numpy* package and its key data structure is called the DataFrame. 
3. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

In [1]:
import numpy as np
import pandas as pd
import sklearn

In [2]:
from sklearn.datasets import load_boston

dataset = load_boston()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target

## 1. how to find the head or tail of the data set?

In [3]:
df.head(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [4]:
df.tail(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.12,76.7,2.2875,1.0,273.0,21.0,396.9,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0.0,0.573,6.03,80.8,2.505,1.0,273.0,21.0,396.9,7.88,11.9


## 2. How to get information about your data set?

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  target   506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB


## 3. How to get the dimensions of your data set?

In [6]:
df.shape

(506, 14)

## 4. How to fetch the total row count of a data set and store it in a vairable?

In [7]:
#There could be many ways to achieve it, few options are below:

In [8]:
total_row_count = len(df['target'].index)

In [9]:
total_row_count

506

In [10]:
total_row_count_v2 = len(df.axes[0])

In [11]:
total_row_count_v2

506

## 5. how to fetch the total number of columns of a data set and store it in a vairable?

In [12]:
total_column_count = len(df.axes[1])

In [13]:
total_column_count

14

## 6. How to see statistical details of your dataset?

In [14]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


## 7. How to find missing values from a data frame?

In [29]:
# let's create a dummy data frame using below code..

d = {'Employee' : ['X', 'Y', 'Z'], 'Salary' : [1000, 2000, np.nan]}
emp = pd.DataFrame(d)
emp

Unnamed: 0,Employee,Salary
0,X,1000.0
1,Y,2000.0
2,Z,


In [27]:
# find the row where salary is missing..

emp[pd.isnull(emp["Salary"])]

Unnamed: 0,Employee,Salary
2,Z,


## 8. How to replace misisng Salary in above data set with the average salary?

In [37]:
# fillna() method is being used, note the inplace = True parameter.

emp['Salary'].fillna(value=emp['Salary'].mean(), inplace = True)

In [38]:
emp

Unnamed: 0,Employee,Salary
0,X,1000.0
1,Y,2000.0
2,Z,1500.0
