# Data Preprocessing

**Objective:**

To perform data preprocessing techniques.

**Secondary Objectives:**

* To study Feature Scaling with Normalization and Standardization.
* To study conversion of catgorical data into numeric using various methods.
* To handle missing or irrelevant data.

Data preprocessing in Machine Learning is a crucial step that enhances the quality of data to promote the extraction of meaningful insights from the data. Data preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models.

**Why do we need data preprocessing in machine learning?**

Generally, real-world data is incomplete, inconsistent, inaccurate and often lacks specific attribute/values. This is where data preprocessing is used – it helps to clean, format, and organize the raw data, thereby making it ready for building Machine Learning models. Preprocessing removes outliers and scales the features to an equivalent range. 

**Steps or techniques to perform data processing:**

* Import Libraries to be used.
* Import the dataset.
* Remove the missing or irrelevant data.
* Encode categorical data to numeric type.
* Perform feature scaling techniques (Normalization and Standardization).
* Then you can build your model with preprocessed data.




The first step is to load the libraries and the dataset. Here, I am going to use indian_food dataset. Let's see the details...

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import  StandardScaler
df= pd.read_csv('../input/indian-food-101/indian_food.csv')
df.head()

In [None]:
df.shape

Here, we can see it has 255 columns and 9 rows. Most of the data is of categorical type. In the next step, we will count the number of unique ingredients used. 

In [None]:
ingre = set()
for i in df['ingredients']:
    ingre.update(str(i).lower().split(","))
    
print("Total unique ingredients in dataset",len(ingre),sep=": ")

As we can see there are 425 unique ingredients used. Now, we will count the number of ingredient used for a particular dish. 

In [None]:
def count_ingredient(column):
    return float(len(column.split(",")))
df['ingredient_count'] = df['ingredients'].apply(count_ingredient)
df.head()

Now we have 10 columns. But, we will drop the ingrident column as we already have the count of ingredients.

In [None]:
df.drop('ingredients', axis=1, inplace=True)
df.head()

In [None]:
df.describe()

In [None]:
df.describe(include='object')

Here, as we can see the min prep_time and cook_time is -1 which is not a realistic value. So, we will replace all the -1 in numeric as well as in categorical to NaN value. Then check for the unique values.

In [None]:
df.replace(-1, np.NaN, inplace = True)
df.replace("-1", np.NaN, inplace = True)
df.nunique()

Here, we are going to drop the NaN data and the new dataset is named as 'data'

In [None]:
data = df.dropna()

The shape which was (255,9) is now reduced to (180,9). 

In [None]:
data.shape

In [None]:
data.describe()

Next, I will be seperating the numeric data and categorical data.So that we could easily perform different operations.

In [None]:
data.dtypes

Here, I am storing numerical data into "num_data".

In [None]:
num = (data.dtypes == 'float64')
numerical = list(num[num].index)
print("Numerical variables are:")
print(numerical)

In [None]:
num_data = data[numerical]
num_data.head()

Next, we will perform feature scaling on this numeric data.

**What is Feature scaling?**

Feature Scaling is a technique to represent in the data in a fixed range. In our example prep_time and cook time variables have different range. That is prep_time should be less than cook_time. 

![![image.png](attachment:image.png)](https://techondiary.files.wordpress.com/2019/02/capture.png?w=660)

* **Normalization using MinMaxScaler() from sklearn:**



In [None]:
scaler = MinMaxScaler()
num_data_values = num_data.values
num_data_scaled = scaler.fit_transform(num_data_values)
normalized_df = pd.DataFrame(num_data_scaled)
normalized_df.head()

In [None]:
normalized_df.describe()

* **Data Standardisation using StandardScaler()**

In [None]:
std_scaler = StandardScaler()
num_data_values = num_data.values
num_data_std= std_scaler.fit_transform(num_data_values)
standardized_df = pd.DataFrame(num_data_std)
standardized_df.head()

In [None]:
standardized_df.describe()

Now, we will encode the categorical data into numeric data using various methods

In [None]:
data.describe(include='object')

Here, I have seperated categorical data from the base dataset.

In [None]:
cat = (data.dtypes == 'object')
objects = list(cat[cat].index)
print("Categorical variables are:")
print(objects)

In [None]:
cat_data = data[objects]
cat_data.head()

* **Label Encoder** 
 
Now, I will encode two features that is 'course' and 'state' using label_encoder from scikitlearn.

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
cat_data['course'] = label_encoder.fit_transform(cat_data['course'])
cat_data['state'] = label_encoder.fit_transform(cat_data['state'])
cat_data.head()

* **Replace() method**

Now,using replace() method replacing the flavor_profile and diet with numerical values.

In [None]:
f_pro = {'sweet':1,'spicy':2, 'bitter':3, 'sour':4}
cat_data = cat_data.replace({'flavor_profile':f_pro})
cat_data.head()

In [None]:
ndiet={'vegetarian':0,'non-vegitarian':1}
cat_data= cat_data.replace({'diet':ndiet})
cat_data.head()

* **get_dummies()**

In this, we will use get_dummies() method. Here,cat_data is the dataframe and we use 'region' to specify which columns we want to be in dummy code. categorical variables in region are recoded into a set of separate binary variables (dummy variables). The next question is “what is a dummy variable?”. Typically, a dummy variable (or column) is one which has a value of one (1) when a categorical event occurs and zero (0) when it doesn’t occur Furthermore, this re-coding is called “dummy coding” and involves the creation of a table called contrast matrix.

In [None]:
cat_data = pd.get_dummies(cat_data,columns=['region'],prefix = ['cat'])
cat_data.head()

In [None]:
cat_data.describe()

**Conclusion:**

Thus, we preprocessed the data using normalization and standardization methods and convert the categorical data to numeric data for ease of building models.