<a href="https://colab.research.google.com/github/sulav063/Machine_Learning_Workshop/blob/main/Panda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas 🐼🐼🐼🐼

📚 Data Preprocesing in Machine Learning using Pandas

🔗 Data preprocessing is required tasks for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model.

🎯 Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays. Preprocessing is the process of doing a pre-analysis of data, in order to transform them into a standard and normalized format.

💡 Preprocessing involves the following aspects:

*   missing values
*   data standardization
*   data normalization
*   data binning

## Day 3

In [None]:
import pandas as pd

In [None]:
data = {
    "cars": ["BMW", "Volvo", "Ford"],
    "Models":["2023", "2009","2012"],
    "passenger": ["4", "5", "2"]
}
data

{'cars': ['BMW', 'Volvo', 'Ford'],
 'Models': ['2023', '2009', '2012'],
 'passenger': ['4', '5', '2']}

In [None]:
# convert the data dictionary to dataframe
df = pd.DataFrame(data)
df

Unnamed: 0,cars,Models,passenger
0,BMW,2023,4
1,Volvo,2009,5
2,Ford,2012,2


In [None]:
df.to_csv("/content/drive/MyDrive/Colab Notebooks/Pandas/cars.csv")

In [None]:
# create a dataframe with two features calories and duration
df = pd.DataFrame({
    "calories": [420, 380, 390],
    "duration": [50, 40, 45]
})
df

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45


In [None]:
df.to_csv("/content/drive/MyDrive/Colab Notebooks/Pandas/calories.csv")

## Day 4

In [None]:
# create a data frame with indexing
import pandas as pd
data = {
    "calories": [420, 380, 390],
    "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
df

Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45


In [None]:
print(df.loc[["day1","day2"]])

      calories  duration
day1       420        50
day2       380        40


### Data cleaning with Pandas
Bad data can be

*   empty cells
*   data in wrong format
*   Wrong data
*   Duplicates



In [None]:
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Pandas/pandas_dataset.csv")

df.head(5)

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0


In [None]:
df["Date"] = df["Date"].str.replace('\'','',regex=False)
df.head(5)

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0


In [None]:
# return a Dataframe with no empty cells
# # df.dropna() will remove empty cells but will not change the dataframe, if we want data frame to be changed, use inplace=true
new_df = df.dropna()
new_df


Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [None]:
#another way of handling the null values is by replacing the null values
df_new2 = df.fillna(130)
df_new2

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [None]:
# fill specific column's null value
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [None]:
df['Calories'].fillna(150, inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Calories'].fillna(150, inplace=True)


Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [None]:
# fill null values with mean, median or  mode
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Pandas/pandas_dataset.csv")
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Calories"].fillna(x, inplace = True)


Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020/12/01',110,130,409.1
1,60,2020/12/02',117,145,479.0
2,60,2020/12/03',103,135,340.0
3,45,2020/12/04',109,175,282.4
4,45,2020/12/05',117,148,406.0
5,60,2020/12/06',102,127,300.0
6,60,2020/12/07',110,136,374.0
7,450,2020/12/08',104,134,253.3
8,30,2020/12/09',109,133,195.1
9,60,2020/12/10',98,124,269.0
