**Task 1 focused on understanding and preparing the dataset for analysis through Data Immersion and Wrangling. The dataset was explored in detail and a data dictionary was created to describe each column. A data quality assessment was performed to identify issues such as missing values, duplicates, incorrect formats, and outliers. These issues were resolved through data cleaning techniques including handling missing data, removing duplicates, and correcting data types. New useful features such as Age from Date of Birth were created to enhance the dataset. Finally, a clean, analysis-ready dataset was produced.**

**IMPORT LIBRARIES** 

In [5]:
import pandas as pd
from datetime import datetime

**Load Your Dataset**

In [6]:
df = pd.read_csv("student_dataset.csv")


**View Dataset - First 5 rows**

In [7]:
print(df.head())     # First 5 rows


  Student_ID           Name Date_of_Birth Gender  Course  Year           City  \
0      S0001  Vihaan Sharma    2005-01-31      M    B.Sc     2  Visakhapatnam   
1      S0002   Saanvi Reddy    2006-09-25      M  B.Tech     1        Chennai   
2      S0003  Krishna Gupta    2002-04-19      M      BA     4        Chennai   
3      S0004     Myra Mehta    2005-02-13      M    B.Sc     4        Kolkata   
4      S0005   Ishaan Verma    2004-05-31      F  B.Tech     1          Delhi   

                        Email  
0  vihaan.sharma1@example.com  
1   saanvi.reddy2@example.com  
2  krishna.gupta3@example.com  
3     myra.mehta4@example.com  
4   ishaan.verma5@example.com  


**View Dataset - Rows and Columns**

In [8]:
print(df.shape)      


(1000, 8)


**View Dataset - Column names**

In [9]:
print(df.columns)  

Index(['Student_ID', 'Name', 'Date_of_Birth', 'Gender', 'Course', 'Year',
       'City', 'Email'],
      dtype='object')


**check the data types**

In [10]:
df.info()
df.describe(include='all')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Student_ID     1000 non-null   object
 1   Name           1000 non-null   object
 2   Date_of_Birth  1000 non-null   object
 3   Gender         1000 non-null   object
 4   Course         1000 non-null   object
 5   Year           1000 non-null   int64 
 6   City           1000 non-null   object
 7   Email          1000 non-null   object
dtypes: int64(1), object(7)
memory usage: 62.6+ KB


Unnamed: 0,Student_ID,Name,Date_of_Birth,Gender,Course,Year,City,Email
count,1000,1000,1000,1000,1000,1000.0,1000,1000
unique,1000,199,829,2,5,,10,1000
top,S0001,Sara Iyer,2006-11-17,F,BBA,,Bengaluru,vihaan.sharma1@example.com
freq,1,13,3,514,220,,111,1
mean,,,,,,2.481,,
std,,,,,,1.108543,,
min,,,,,,1.0,,
25%,,,,,,2.0,,
50%,,,,,,2.0,,
75%,,,,,,3.0,,


**Working on Missing Values**

In [11]:
print(df.isnull().sum())


Student_ID       0
Name             0
Date_of_Birth    0
Gender           0
Course           0
Year             0
City             0
Email            0
dtype: int64


**Working on Duplicate Rows**

In [12]:
print(df.duplicated().sum())


0


**Check the Unique Values**

In [13]:
print(df["Gender"].unique())
print(df["Course"].unique())


['M' 'F']
['B.Sc' 'B.Tech' 'BA' 'B.Com' 'BBA']


**Remove Duplicate Rows
Handle Missing Values**

In [14]:
df=df.drop_duplicates()
df = df.dropna()

**Fill missing values as Unknown**

In [15]:
df["City"].fillna("Unknown", inplace=True)


**conversion of dob into date format**

In [16]:
df["Date_of_Birth"] = pd.to_datetime(df["Date_of_Birth"])


**standardixe text values**

In [17]:
df["Gender"] = df["Gender"].str.upper()


**Feature Engineering
create Age column from Date of Birth**

In [18]:
from datetime import datetime

today = pd.to_datetime("today")

df["Age"] = (today - df["Date_of_Birth"]).dt.days // 365
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Student_ID     1000 non-null   object        
 1   Name           1000 non-null   object        
 2   Date_of_Birth  1000 non-null   datetime64[ns]
 3   Gender         1000 non-null   object        
 4   Course         1000 non-null   object        
 5   Year           1000 non-null   int64         
 6   City           1000 non-null   object        
 7   Email          1000 non-null   object        
 8   Age            1000 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(6)
memory usage: 70.4+ KB
None


**save DataSet**

In [19]:
df.to_csv("cleaned_student_dataset.csv", index=False)
