#Data cleaning

### Loading the Dataset
The dataset is loaded using pandas to analyze and clean the data.


In [4]:
import pandas as pd

df = pd.read_csv('/content/student_data.csv')
display(df.head())

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


### Importing Required Libraries
Pandas is imported to handle data loading, cleaning, and transformation efficiently. It provides powerful data structures that make working with tabular datasets easier compared to manual tools like Excel.


### Initial Data Exploration
The first few rows of the dataset are displayed using `.head()` and structural information is checked using `.info()`. This helps understand column names, data types, and the presence of missing values.


In [5]:
import pandas as pd
import numpy as np

# Load the data
df = pd.read_csv('student_data.csv')

# Display structure
print("--- Dataset Info ---")
df.info()
print("\n--- First 5 Rows ---")
df.head()



--- Dataset Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    o

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


### Identifying Missing Values
Missing values are identified using `.isnull().sum()`. This step is important because missing data can lead to incorrect analysis or biased results if not handled properly.


In [6]:
# Identify missing values
missing_data = df.isnull().sum()
print(missing_data)


school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64


### Handling Missing Values
Missing values are cleaned using appropriate methods such as filling with mean, median, or mode, or removing rows if required. This ensures the dataset is complete and ready for analysis.


### Removing Duplicate Records
Duplicate rows are removed using `.drop_duplicates()` to avoid repeated information. This helps maintain data accuracy and prevents inflated results during analysis.


In [7]:
# Remove duplicates
initial_count = len(df)
df = df.drop_duplicates()
print(f"Removed {initial_count - len(df)} duplicate rows.")

# Fill missing numeric values with mean (e.g., age)
if df['age'].isnull().any():
    df['age'] = df['age'].fillna(df['age'].mean())


Removed 0 duplicate rows.


### Creating New Features
A new column is created using logical conditions to demonstrate feature engineering skills. Creating derived columns helps extract more meaningful insights from existing data.


In [8]:
# Create new column using logic
df['age_category'] = np.where(df['age'] >= 18, 'Adult', 'Minor')
df[['age', 'age_category']].head(10)


Unnamed: 0,age,age_category
0,18,Adult
1,17,Minor
2,15,Minor
3,15,Minor
4,16,Minor
5,16,Minor
6,16,Minor
7,17,Minor
8,15,Minor
9,15,Minor


### Saving the Cleaned Dataset
The cleaned dataset is saved as a new CSV file using `.to_csv()`. This ensures the processed data can be reused for reporting, visualization, or further analysis.


In [9]:
# Save to CSV
df.to_csv('cleaned_data.csv', index=False)
print("File 'cleaned_data.csv' has been created and is ready for download.")


File 'cleaned_data.csv' has been created and is ready for download.
