<a href="https://colab.research.google.com/github/silwalprabin/BoofCV/blob/master/Lab1_Data_Pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1.1 Data Cleaning:**
You are given a dataset employee_data.csv containing information about employees, including their ID, name, age, department, and salary. The dataset has missing values and some inconsistencies in the department names (e.g., "HR", "Human Resources", "H.R." should all be treated as "HR"). Perform the following data cleaning tasks:


1.   Handle missing values in the dataset.
2.   Standardize the department names to ensure consistency.
3.   Remove any duplicate records.

**Tasks:**


*   Load the dataset and inspect the first few rows.
*   Identify and handle missing values in the dataset.
*   Standardize department names by replacing variations with a single standard value.
*   Remove duplicate records based on the ID column.


In [None]:
# SAMPLE DATA:: employee_data.csv
ID,Name,Age,Department,Salary
1,John,28,HR,50000
2,Jane,35,Finance,60000
3,Emily,,HR,55000
4,Michael,40,Human Resources,
5,Sarah,29,IT,52000
6,David,50,Finance,75000
7,Laura,38,H.R.,68000
8,Robert,32,HR,57000
9,Linda,45,IT,62000
10,James,30,HR,51000
11,James,30,HR,51000

In [None]:
import pandas as pd
# Step 1: Load the dataset
df = pd.read_csv('lab1-datasets/employee_data.csv')
print("Initial Data:\n", df.head())
# Step 2: Handle missing values# Fill missing 'Age' with the mean age and 'Salary' with the mean salary
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
# Step 3: Standardize department names
df['Department'] = df['Department'].replace({
    'Human Resources': 'HR',
    'H.R.': 'HR',
    'hr': 'HR'
})
# Step 4: Remove duplicate records based on 'ID'
df.drop_duplicates(subset='ID', keep='first', inplace=True)
print("\nCleaned Data:\n", df.head())


Initial Data:
    ID     Name   Age       Department   Salary
0   1     John  28.0               HR  50000.0
1   2     Jane  35.0          Finance  60000.0
2   3    Emily   NaN               HR  55000.0
3   4  Michael  40.0  Human Resources      NaN
4   5    Sarah  29.0               IT  52000.0

Cleaned Data:
    ID     Name   Age Department   Salary
0   1     John  28.0         HR  50000.0
1   2     Jane  35.0    Finance  60000.0
2   3    Emily  35.7         HR  55000.0
3   4  Michael  40.0         HR  58100.0
4   5    Sarah  29.0         IT  52000.0


**1.2 Normalization:**
You are given a dataset student_scores.csv that contains the scores of students in different subjects. The scores are on different scales (e.g., some are out of 100, others out of 50). Normalize the scores to a common scale for comparison.


1.   Normalize the scores of all subjects to a 0-1 scale using Min-Max normalization.
2.   Compare the original and normalized scores.

**Tasks:**

*   Load the dataset and inspect the first few rows.
*   Apply Min-Max normalization to the scores of all subjects.
*   Display the original and normalized scores side by side.


In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Step 1: Load the dataset
df = pd.read_csv('lab1-datasets/student_scores.csv')
print("Initial Data:\n", df.head())

# Step 2: Apply Min-Max normalization
scaler = MinMaxScaler()
df[['Math', 'Science', 'English']] = scaler.fit_transform(df[['Math', 'Science', 'English']])

print("\nNormalized Scores:\n", df.head())


Initial Data:
    StudentID  Math  Science  English
0          1    78       65       80
1          2    88       75       85
2          3    60       50       55
3          4    90       78       92
4          5    55       48       58

Normalized Scores:
    StudentID      Math  Science   English
0          1  0.657143  0.53125  0.675676
1          2  0.942857  0.84375  0.810811
2          3  0.142857  0.06250  0.000000
3          4  1.000000  0.93750  1.000000
4          5  0.000000  0.00000  0.081081


**1.3 Data Binning**
You are given a dataset customer_ages.csv that contains the ages of customers. Perform data binning on the Age column to group customers into age ranges: "Young" (18-30), "Middle-aged" (31-50), and "Senior" (51 and above).


1.   Perform data binning on the Age column.
2.   Assign a category label to each age group.
3.   Analyze the distribution of customers across the age groups.

**Tasks:**


*   Load the dataset and inspect the first few rows.
*   Create bins for the Age column and assign category labels.
*   Calculate the number of customers in each age group.


In [None]:
import pandas as pd

# Step 1: Load the dataset
df = pd.read_csv('lab1-datasets/customer_ages.csv')
print("Initial Data:\n", df.head())

# Step 2: Create bins and assign labels
bins = [18, 30, 50, 100]
labels = ['Young', 'Middle-aged', 'Senior']
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

print("\nData after Binning:\n", df.head())

# Step 3: Calculate distribution of customers in each age group
age_group_distribution = df['AgeGroup'].value_counts()
print("\nAge Group Distribution:\n", age_group_distribution)


Initial Data:
    CustomerID  Age
0           1   25
1           2   42
2           3   36
3           4   53
4           5   28

Data after Binning:
    CustomerID  Age     AgeGroup
0           1   25        Young
1           2   42  Middle-aged
2           3   36  Middle-aged
3           4   53       Senior
4           5   28        Young

Age Group Distribution:
 AgeGroup
Middle-aged    7
Young          5
Senior         3
Name: count, dtype: int64


**1.4 Discritization**
You are given a dataset sales_data.csv that contains the monthly sales figures of a company. The sales figures are continuous values. Discretize the sales data into categories such as "Low", "Medium", and "High" based on sales volume.


1.   Discretize the Sales column into three categories.
2.   Assign a category label based on the discretized sales values.
3.   Analyze the distribution of sales across the categories.

**Tasks:**

*   Load the dataset and inspect the first few rows.
*   Apply discretization to the Sales column.
*   Assign appropriate category labels and analyze the distribution.


In [None]:
import pandas as pd

# Step 1: Load the dataset
df = pd.read_csv('lab1-datasets/sales_data.csv')
print("Initial Data:\n", df.head())

# Step 2: Apply discretization
bins = [0, 5000, 20000, float('inf')]
labels = ['Low', 'Medium', 'High']
df['SalesCategory'] = pd.cut(df['Sales'], bins=bins, labels=labels)

print("\nData after Discretization:\n", df.head())

# Step 3: Analyze the distribution of sales categories
sales_category_distribution = df['SalesCategory'].value_counts()
print("\nSales Category Distribution:\n", sales_category_distribution)


Initial Data:
       Month  Sales
0   January  15000
1  February  18000
2     March  12000
3     April  30000
4       May  22000

Data after Discretization:
       Month  Sales SalesCategory
0   January  15000        Medium
1  February  18000        Medium
2     March  12000        Medium
3     April  30000          High
4       May  22000          High

Sales Category Distribution:
 SalesCategory
Medium    7
High      4
Low       1
Name: count, dtype: int64


**1.5 Feature Selection**
You are given a dataset medical_data.csv that contains several features related to patients' medical history and a target variable indicating whether they have a specific disease. Perform feature selection to identify the most important features for predicting the disease.

1.   Use a feature selection method (e.g., Chi-square test, ANOVA, or correlation) to rank the features.
2.   Identify the top 3 features related to the target variable.
3.   Discuss how the selected features could influence the prediction.

**Tasks:**


*   Load the dataset and inspect the first few rows.
*   Apply a feature selection method to rank the features.
*   Identify and display the top 3 features.


In [None]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2

# Step 1: Load the dataset
df = pd.read_csv('lab1-datasets/medical_data.csv')
print("Initial Data:\n", df.head())

# Step 2: Define features and target variable
X = df.drop(columns=['Disease'])
y = df['Disease']

# Step 3: Apply Chi-square feature selection
selector = SelectKBest(score_func=chi2, k=3)
selector.fit(X, y)

# Step 4: Get the top 3 features
top_features = X.columns[selector.get_support()]
print("\nTop 3 Features for Predicting Disease:\n", top_features)


Initial Data:
    PatientID  Age  BloodPressure  Cholesterol  Glucose  HeartRate  Disease
0          1   45            130          180       95         70        1
1          2   50            140          200      105         75        1
2          3   60            150          240      120         80        1
3          4   40            120          170       90         65        0
4          5   35            110          160       85         60        0

Top 3 Features for Predicting Disease:
 Index(['Age', 'Cholesterol', 'Glucose'], dtype='object')
