<a href="https://colab.research.google.com/github/sahith-krishna19/EDA/blob/main/Module_2_Data_Transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2: Data Transformation

## Student Details
**Name**: Sahith Krishna
**Registration Number**: 21BDS0078

## Overview
This module covers data transformation techniques applied to the dataset, including handling missing values, discretization, deduplication, and outlier detection. The traditional method of Maximum Likelihood Estimation (MLE) will be used to estimate missing values.

## Step 1: Load the Dataset
Load the dataset directly from the GitHub repository.

In [1]:

# Importing the necessary libraries
import pandas as pd

# Load the dataset from GitHub link
url = 'https://github.com/sahith-krishna19/EDA/blob/main/PhDPublications.csv?raw=true'
data = pd.read_csv(url)

# Display the first few rows of the dataset
data.head()


Unnamed: 0,rownames,articles,gender,married,kids,prestige,mentor
0,1,0,male,yes,0,2.52,7
1,2,0,female,no,0,2.05,6
2,3,0,female,no,0,3.75,6
3,4,0,male,yes,1,1.18,3
4,5,0,female,no,0,3.75,26


## Step 2: Data Deduplication
Check and remove duplicate entries to ensure data quality.

In [2]:

# Check for duplicate rows
duplicates = data.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

# Remove duplicates if present
data = data.drop_duplicates()
print(f"Dataset shape after removing duplicates: {data.shape}")


Number of duplicate rows: 0
Dataset shape after removing duplicates: (915, 7)


## Step 3: Handling Missing Data
Identify and handle missing values in the dataset using traditional methods.

In [3]:

# Check for missing values in each column
missing_values = data.isnull().sum()
print("Missing values in each column:")
print(missing_values)

# Handling missing values in 'prestige' column with the mean
data['prestige'].fillna(data['prestige'].mean(), inplace=True)

# Verify that there are no more missing values
print("Missing values after handling:")
print(data.isnull().sum())


Missing values in each column:
rownames    0
articles    0
gender      0
married     0
kids        0
prestige    0
mentor      0
dtype: int64
Missing values after handling:
rownames    0
articles    0
gender      0
married     0
kids        0
prestige    0
mentor      0
dtype: int64


## Step 4: Handling Missing Data with Maximum Likelihood Estimation (MLE)
Using MLE to estimate missing values in the 'prestige' column based on other features.

In [4]:

from sklearn.linear_model import LinearRegression
import numpy as np

# Create a new dataset with missing values for illustration
data_mle = data.copy()
data_mle.loc[0:5, 'prestige'] = np.nan  # Introduce missing values for demonstration

# Prepare the data for MLE - exclude rows where 'prestige' is missing for training
train_data = data_mle.dropna(subset=['prestige'])
predict_data = data_mle[data_mle['prestige'].isnull()]

# Define the features (excluding 'prestige') and target ('prestige')
X_train = train_data[['articles', 'kids', 'mentor']]
y_train = train_data['prestige']
X_predict = predict_data[['articles', 'kids', 'mentor']]

# Train a simple linear regression model as an MLE estimator
mle_model = LinearRegression()
mle_model.fit(X_train, y_train)

# Predict missing 'prestige' values using MLE
predicted_values = mle_model.predict(X_predict)
data_mle.loc[data_mle['prestige'].isnull(), 'prestige'] = predicted_values

# Display the updated dataset with estimated 'prestige' values
data_mle.head()


Unnamed: 0,rownames,articles,gender,married,kids,prestige,mentor
0,1,0,male,yes,0,3.089987,7
1,2,0,female,no,0,3.062587,6
2,3,0,female,no,0,3.062587,6
3,4,0,male,yes,1,2.93196,3
4,5,0,female,no,0,3.610588,26


## Step 5: Data Discretization
Discretize continuous variables like 'articles' into categorical bins.

In [5]:

# Define bins for discretizing 'articles' column
bins = [0, 2, 5, 10, 20]
labels = ['0-2', '3-5', '6-10', '11-20']
data['articles_binned'] = pd.cut(data['articles'], bins=bins, labels=labels)

# Display the updated dataframe with binned categories
data[['articles', 'articles_binned']].head()


Unnamed: 0,articles,articles_binned
0,0,
1,0,
2,0,
3,0,
4,0,


## Step 6: Outlier Detection
Detect and handle outliers in numerical data using the Interquartile Range (IQR) method.

In [6]:

# Outlier detection using Interquartile Range (IQR) for 'prestige' column
Q1 = data['prestige'].quantile(0.25)
Q3 = data['prestige'].quantile(0.75)
IQR = Q3 - Q1

# Defining outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detecting outliers
outliers = data[(data['prestige'] < lower_bound) | (data['prestige'] > upper_bound)]
print(f"Number of outliers detected: {len(outliers)}")

# Optionally remove outliers (uncomment the line below to remove them)
# data = data[(data['prestige'] >= lower_bound) & (data['prestige'] <= upper_bound)]
print(f"Dataset shape after outlier handling: {data.shape}")


Number of outliers detected: 0
Dataset shape after outlier handling: (915, 8)


## Conclusion
This notebook demonstrates various data transformation techniques including deduplication, handling missing values with MLE, discretization, and outlier detection. The cleaned and transformed dataset will be used in subsequent modules for deeper analysis.