<a href="https://colab.research.google.com/github/sahith-krishna19/EDA/blob/main/Module_2_Data_Transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2: Data Transformation

## Student Details
**Name**: Sahith Krishna
**Registration Number**: 21BDS0078

## Overview
This module covers data transformation techniques applied to the dataset, including handling missing values, discretization, and outlier treatment.

## Step 1: Load the Dataset
Loading the dataset directly from the GitHub repository.

In [1]:

# Importing the necessary libraries
import pandas as pd

# Load the dataset from GitHub link
url = 'https://github.com/sahith-krishna19/EDA/blob/main/PhDPublications.csv?raw=true'
data = pd.read_csv(url)

# Display the first few rows of the dataset
data.head()


Unnamed: 0,rownames,articles,gender,married,kids,prestige,mentor
0,1,0,male,yes,0,2.52,7
1,2,0,female,no,0,2.05,6
2,3,0,female,no,0,3.75,6
3,4,0,male,yes,1,1.18,3
4,5,0,female,no,0,3.75,26


## Step 2: Data Deduplication
Check and remove duplicate entries to ensure data quality.

In [2]:

# Check for duplicate rows
duplicates = data.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

# Remove duplicates if present
data = data.drop_duplicates()
data.shape


Number of duplicate rows: 0


(915, 7)

## Step 3: Handling Missing Data
Identifying and handling missing values using various strategies.

In [3]:

# Check for missing values in each column
missing_values = data.isnull().sum()
print("Missing values in each column:")
print(missing_values)

# Handling missing values - Example: Filling missing 'prestige' values with mean
data['prestige'].fillna(data['prestige'].mean(), inplace=True)

# Verify that there are no more missing values
data.isnull().sum()


Missing values in each column:
rownames    0
articles    0
gender      0
married     0
kids        0
prestige    0
mentor      0
dtype: int64


Unnamed: 0,0
rownames,0
articles,0
gender,0
married,0
kids,0
prestige,0
mentor,0


## Step 4: Data Discretization
Discretizing continuous variables like 'articles' into categorical bins.

In [4]:

# Define bins for discretizing 'articles' column
bins = [0, 2, 5, 10, 20]
labels = ['0-2', '3-5', '6-10', '11-20']
data['articles_binned'] = pd.cut(data['articles'], bins=bins, labels=labels)

# Display the updated dataframe
data[['articles', 'articles_binned']].head()


Unnamed: 0,articles,articles_binned
0,0,
1,0,
2,0,
3,0,
4,0,


## Step 5: Handling Outliers
Detecting and handling outliers in numerical data.

In [5]:

# Outlier detection using Interquartile Range (IQR) for 'prestige' column
Q1 = data['prestige'].quantile(0.25)
Q3 = data['prestige'].quantile(0.75)
IQR = Q3 - Q1

# Defining outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detecting outliers
outliers = data[(data['prestige'] < lower_bound) | (data['prestige'] > upper_bound)]
print(f"Number of outliers detected: {len(outliers)}")

# Optionally remove outliers (uncomment the line below to remove them)
# data = data[(data['prestige'] >= lower_bound) & (data['prestige'] <= upper_bound)]
data.shape


Number of outliers detected: 0


(915, 8)

## Conclusion
This notebook demonstrates various data transformation techniques including deduplication, handling missing values, discretization, and outlier detection. The cleaned and transformed dataset will be used in subsequent modules for deeper analysis.