<h1>PROJECT</h1>

**Project Title**: Data Cleaning, Preprocessing, and Visualization of Student Performance Dataset

**Project Overview**:
> In this project, you will be working with a student performance dataset that contains information on students' demographic details and their scores in various subjects. The dataset also includes some inconsistencies, missing values, outliers, and data type errors that need to be addressed. Your task is to clean, preprocess, and visualize the data to gain meaningful insights into student performance.

In [16]:
import pandas as pd
import numpy as np

# Creating the enriched dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Helen', 'Ivy', 'Jack',
             'Alice', 'Eva', 'Charlie', 'Eva'],  # Duplicate names
    'Age': [25, 30, 35, np.nan, 40, 50, -999, 28, None, 45, 25, 30, 35, 40],  # Missing values, None, Outlier (-999)
    'Gender': ['Female', 'Male', 'male', 'Female', 'Female', np.nan, 'Female', 'f', 'Female', 'Male',
               'Female', 'Female', 'Male', 'Female'],  # Inconsistent capitalization
    'Math_Score': [88, 92, 100, 78, 85, np.nan, 110, 95, 60, 88, 88, np.nan, 'N/A', 85],  # Missing values, Outliers, Incorrect data type
    'Science_Score': [np.nan, 85, 95, 70, 'missing', 98, 100, 105, 60, 77, 85, 85, 70, 70],  # Missing values, Inconsistent data representation
    'Enrollment_Date': ['2020-01-15', '15/02/2020', '2020/03/10', 'April 5, 2020', np.nan, '2020-05-25',
                        '2020-06-30', 'July 10, 2020', '2020-08-15', '2020-09-01', '2020-01-15', '2020-05-25',
                        '2020/03/10', 'April 5, 2020'],  # Inconsistent date formats, Missing values
    'Graduated': ['Yes', 'No', 'No', 'Yes', 'No', np.nan, 'yes', 'No', 'Yes', 'no', 'Yes', 'No', 'No', 'Yes'],  # Inconsistent capitalization
}

# Converting into a DataFrame
df = pd.DataFrame(data)
df


Unnamed: 0,Name,Age,Gender,Math_Score,Science_Score,Enrollment_Date,Graduated
0,Alice,25.0,Female,88.0,,2020-01-15,Yes
1,Bob,30.0,Male,92.0,85,15/02/2020,No
2,Charlie,35.0,male,100.0,95,2020/03/10,No
3,David,,Female,78.0,70,"April 5, 2020",Yes
4,Eva,40.0,Female,85.0,missing,,No
5,Frank,50.0,,,98,2020-05-25,
6,Grace,-999.0,Female,110.0,100,2020-06-30,yes
7,Helen,28.0,f,95.0,105,"July 10, 2020",No
8,Ivy,,Female,60.0,60,2020-08-15,Yes
9,Jack,45.0,Male,88.0,77,2020-09-01,no


**Issues Present in the Dataset:**<br>
**Duplicate Entries:**
> * Duplicate names and rows.<br>

**Missing Values:**
> * NaN, None, and 'missing' strings in numeric and categorical columns.

**Inconsistent Data Formats:**
> * Dates are in different formats (e.g., 2020-01-15, 15/02/2020, April 5, 2020).
Inconsistent capitalization in the Gender and Graduated columns.
Math_Score contains an outlier (110) and a non-numeric entry ('N/A').

**Outliers:**
> * Age column contains an outlier value of -999.
Math_Score has a score of 110, which might be unrealistic depending on the context.

**Incorrect Data Types:**
> * Science_Score has a string entry ('missing').<br>
> * Math_Score contains a non-numeric entry ('N/A').

**Lab Manual Steps for Cleaning and Preprocessing:**

1. **Handling Missing Values**: isna(), fillna(), dropna(), interpolate()
2. **Handling Inconsistent Data Formats**: str.lower(), str.replace(), to_datetime(), apply()
3. **Removing or Handling Duplicate Entries**: duplicated(), drop_duplicates()
4. **Handling Outliers**: clip(), IQR method, z-score, Winsorization
5. **Correcting Data Types**: astype(), to_numeric(), apply()

**Practical Walkthrough:**

**1. Removing Duplicates**

In [17]:
df.drop_duplicates(subset='Name', inplace=True)

**2. Hanndling Missing Values**

In [18]:
df = df.dropna()
df.head()

Unnamed: 0,Name,Age,Gender,Math_Score,Science_Score,Enrollment_Date,Graduated
1,Bob,30.0,Male,92,85,15/02/2020,No
2,Charlie,35.0,male,100,95,2020/03/10,No
6,Grace,-999.0,Female,110,100,2020-06-30,yes
7,Helen,28.0,f,95,105,"July 10, 2020",No
9,Jack,45.0,Male,88,77,2020-09-01,no


**3. Handling Inconsistent Data Formats**

**4. Handling Outliers**

**5. Correcting Data Types**

**6. Visualizing the Distribution of Age**<br>This will help to see how the ages are distributed in the dataset.

**7. Visualizing Gender Distribution**<br>A pie chart to show the proportion of males and females.

**8. Scatter Plot for Math vs Science Scores**<br>This shows the relationship between Math and Science scores.

**9. Histogram for Enrollment Dates**<br>This will help in visualizing the frequency of enrollments over time.

**10. Pair Plot to Visualize Relationships**<br>
Create a pair plot to see relationships between Age, Math_Score, and Science_Score.

**10. Violin Plot for Score Distributions**<br>
Create a violin plot to show the distribution of Math_Score and Science_Score by Gender.

**11. Box Plot of Scores by Graduation Status**<br>
Show the distribution of Math_Score and Science_Score based on whether the student graduated.

**12. KDE Plot of Scores**<br>
Create Kernel Density Estimation (KDE) plots for Math_Score and Science_Score.

**13. Histogram for Each Numerical Column**<br>
Create histograms for all numerical columns (Age, Math_Score, Science_Score).