# Module 6: Feature Engineering I - Categorical Variables and Missing Values

## Introduction
In this module, we will focus on understanding and manipulating categorical variables, which are a common type of data in many datasets. Categorical variables represent types or categories that may or may not have a logical order. We will also explore various strategies for handling missing data, which is crucial for maintaining the integrity of your analysis.

## Objectives
- **Understanding Categorical Variables**: Learn what categorical variables are, how they differ from numerical variables, and why they are important in data science.
- **Handling Missing Data**: Understand the impact of missing data on analyses and learn multiple strategies to impute or handle missing values effectively.

### 1. Categorical Variables
Categorical variables are often used to represent groups or categories. They can be:
- **Nominal**: No natural ordering (e.g., colors, types of cars).
- **Ordinal**: Natural ordering exists (e.g., class levels, ratings).

#### Visualization Techniques:
- Bar charts for nominal data.
- Line graphs or bar charts for ordinal data, where order matters.

#### Manipulation Techniques:
- **One-hot encoding**: Convert categorical variables into a form that could be provided to ML algorithms to do a better job in prediction.
- **Label encoding**: Assign a unique label to each class of the category.

### 2. Handling Missing Values
Missing data can significantly impact the quality of predictions. Techniques to handle missing values include:
- **Deletion**: Remove rows with missing values, which is effective if the number of these rows is insignificant.
- **Imputation**:
  - **Mean/Median/Mode Imputation**: Replace missing values with the mean, median, or mode.
  - **Prediction Models**: Use a machine learning model to predict and fill in missing values based on other available data.
  - **K-Nearest Neighbors (KNN)**: Impute values based on the K-nearest neighbors.
  - **Iterative Imputation**: Model each feature with missing values as a function of other features in a round-robin fashion.

## Practical Exercise (TP)
- **Dataset**: Use a modified Titanic dataset with additional manipulated missing values and categorical features.
- **Tasks**:
  - Visualize different types of categorical data from the Titanic dataset.
  - Apply one-hot encoding and label encoding to convert categorical data into numerical data.
  - Perform different techniques of missing value imputation and compare their impacts on model performance.

## Expected Outcomes
By the end of this module, participants will:
- Be adept at identifying and manipulating categorical data.
- Understand various methods to handle missing data and apply these techniques in practice.
- Be able to improve data quality, which is crucial for building robust predictive models.

This module aims to provide hands-on experience with real-world data, bridging the gap between theoretical knowledge and practical application in feature engineering.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

### Data Collection

In [2]:
df =  pd.read_csv("Car_details.csv", sep=",")

In [3]:
df

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
1,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0
2,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,22.4 kgm at 1750-2750rpm,5.0
3,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0
4,Hyundai Xcent 1.2 VTVT E Plus,2017,440000,45000,Petrol,Individual,Manual,First Owner,20.14 kmpl,1197 CC,81.86 bhp,113.75nm@ 4000rpm,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8122,Hyundai i20 Magna,2013,320000,110000,Petrol,Individual,Manual,First Owner,18.5 kmpl,1197 CC,82.85 bhp,113.7Nm@ 4000rpm,5.0
8123,Hyundai Verna CRDi SX,2007,135000,119000,Diesel,Individual,Manual,Fourth & Above Owner,16.8 kmpl,1493 CC,110 bhp,"24@ 1,900-2,750(kgm@ rpm)",5.0
8124,Maruti Swift Dzire ZDi,2009,382000,120000,Diesel,Individual,Manual,First Owner,19.3 kmpl,1248 CC,73.9 bhp,190Nm@ 2000rpm,5.0
8125,Tata Indigo CR4,2013,290000,25000,Diesel,Individual,Manual,First Owner,23.57 kmpl,1396 CC,70 bhp,140Nm@ 1800-3000rpm,5.0


In [5]:
df.head() 

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
1,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0
2,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,22.4 kgm at 1750-2750rpm,5.0
3,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0
4,Hyundai Xcent 1.2 VTVT E Plus,2017,440000,45000,Petrol,Individual,Manual,First Owner,20.14 kmpl,1197 CC,81.86 bhp,113.75nm@ 4000rpm,5.0


In [7]:
df.isnull().sum()

name               0
year               0
selling_price      0
km_driven          0
fuel               0
seller_type        0
transmission       0
owner              0
mileage          221
engine           221
max_power        215
torque           222
seats            221
dtype: int64