# Project data preprocessing
---



##### In this notebook, we will explore data preprocessing techniques using the student_habits_performance  dataset from Kaggle.
##### We will go through the following steps:
##### 1.Explore the dataset by:
 * Viewing random samples of data.
 *      Identifying the total number of rows and columns.
##### 2.Handle missing values by:
 *      Calculating the percentage of missing data.
 *      Deciding and implementing a method for handling missing values (e.g., filling or dropping).
##### 3.Identify and remove duplicate rows.
---

In [2]:
# Import necessary libraries
import pandas as pd

### Load the student_habits_performance dataset

In [3]:
df = pd.read_csv("student_habits_performance.csv")


### Viewing random samples of data


In [4]:
df.sample(10)

Unnamed: 0,student_id,age,gender,study_hours_per_day,social_media_hours,netflix_hours,part_time_job,attendance_percentage,sleep_hours,diet_quality,exercise_frequency,parental_education_level,internet_quality,mental_health_rating,extracurricular_participation,exam_score
488,S1488,17,Male,5.3,2.5,2.7,Yes,83.2,5.8,Good,3,Bachelor,Average,6,Yes,93.4
65,S1065,18,Female,3.3,4.1,0.7,No,80.5,5.7,Poor,0,Master,Good,8,No,69.6
955,S1955,19,Female,2.7,2.8,1.4,Yes,92.2,10.0,Poor,3,High School,Average,8,No,86.9
332,S1332,20,Male,2.7,1.9,2.3,No,71.6,7.7,Fair,3,High School,Poor,8,No,71.2
63,S1063,17,Male,4.1,2.3,2.6,Yes,76.5,5.1,Fair,3,Bachelor,Average,9,No,77.6
823,S1823,23,Male,4.8,2.8,0.6,No,93.5,6.7,Fair,4,High School,Good,1,No,85.4
591,S1591,24,Female,4.5,4.4,2.2,No,85.1,6.5,Fair,4,High School,Good,1,No,64.0
964,S1964,23,Other,3.5,2.3,2.1,No,84.3,7.6,Good,5,High School,Poor,10,Yes,89.1
220,S1220,19,Female,4.5,1.3,4.2,No,86.4,6.3,Fair,0,High School,Average,10,No,82.1
541,S1541,18,Female,4.2,3.8,3.4,Yes,84.8,5.4,Good,2,High School,Good,5,No,67.4


### dentifying the total number of rows and columns

In [5]:
df.shape

(1000, 16)

### Calculating the percentage of missing data

In [6]:
df.isnull().mean() * 100

student_id                       0.0
age                              0.0
gender                           0.0
study_hours_per_day              0.0
social_media_hours               0.0
netflix_hours                    0.0
part_time_job                    0.0
attendance_percentage            0.0
sleep_hours                      0.0
diet_quality                     0.0
exercise_frequency               0.0
parental_education_level         9.1
internet_quality                 0.0
mental_health_rating             0.0
extracurricular_participation    0.0
exam_score                       0.0
dtype: float64

### Identify and remove duplicate rows

In [7]:
df.duplicated().sum()
df = df.drop_duplicates()

## Data Objects and Attribute Types





#### Identify and store the names of DataFrame columns that contain categorical values.

In [29]:
nominal = [col for col in df.columns if df[col].dtype == 'object']


#### Identifying and storing the names of columns that have exactly two unique values.

In [None]:
binary = [col for col in df.columns if df[col].nunique() == 2]

#### Filter out rows with 'Other' in the sex column

In [15]:
df = pd.read_csv("student_habits_performance.csv")
df = df[df["sex"] != "Other"]
df.to_csv("student_habits_performance.csv", index=False)

#### Remove the 'parental_education_level' column from the dataset

In [16]:
df = pd.read_csv("student_habits_performance.csv")

df = df.drop(columns=["parental_education_level"])

df.to_csv("student_habits_performance.csv", index=False)

#### Remove the 'internet_quality' column from the dataset

In [17]:
df = pd.read_csv("student_habits_performance.csv")

df = df.drop(columns=["internet_quality"])

df.to_csv("student_habits_performance.csv", index=False)

#### Remove the 'diet_quality' column from the dataset

In [18]:
df = pd.read_csv("student_habits_performance.csv")

df = df.drop(columns=["diet_quality"])

df.to_csv("student_habits_performance.csv", index=False)

#### Convert 'sex' column values to numeric

In [25]:
df["sex"] = df["sex"].replace({"Female": 0, "Male": 1}).astype(int)

#### Save the updated dataset back to the CSV file


In [28]:
df.to_csv("student_habits_performance.csv", index=False)

#### Convert all 'Yes' values to 1 and 'No' values to 0 in the dataset

In [27]:
df = df.replace({"Yes": 1, "No": 0})

  df = df.replace({"Yes": 1, "No": 0})
