<a href="https://colab.research.google.com/github/tanvirathore36-DS/FUTURE_DS_03/blob/main/notebooks/01_data_cleaning_and_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 01_Data_Cleaning_and_EDA

This notebook performs data loading, cleaning, and exploratory data analysis (EDA) for the **College Event Feedback** dataset from Task 3 (Future Interns – Data Science & Analytics Internship).

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# make charts display nicely
plt.style.use("ggplot")
sns.set_palette("pastel")


In [4]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [6]:
# Load dataset
df = pd.read_csv("/content/drive/MyDrive/student_feedback.csv")
df.head()



Unnamed: 0.1,Unnamed: 0,Student ID,Well versed with the subject,Explains concepts in an understandable way,Use of presentations,Degree of difficulty of assignments,Solves doubts willingly,Structuring of the course,Provides support for students going above and beyond,Course recommendation based on relevance
0,0,340,5,2,7,6,9,2,1,8
1,1,253,6,5,8,6,2,1,2,9
2,2,680,7,7,6,5,4,2,3,1
3,3,806,9,6,7,1,5,9,4,6
4,4,632,8,10,8,4,6,6,9,9


In [7]:
df.info()
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 10 columns):
 #   Column                                                Non-Null Count  Dtype
---  ------                                                --------------  -----
 0   Unnamed: 0                                            1001 non-null   int64
 1   Student ID                                            1001 non-null   int64
 2   Well versed with the subject                          1001 non-null   int64
 3   Explains concepts in an understandable way            1001 non-null   int64
 4   Use of presentations                                  1001 non-null   int64
 5   Degree of difficulty of assignments                   1001 non-null   int64
 6   Solves doubts willingly                               1001 non-null   int64
 7   Structuring of the course                             1001 non-null   int64
 8   Provides support for students going above and beyond  1001 non-null   int64
 9

Unnamed: 0,0
Unnamed: 0,0
Student ID,0
Well versed with the subject,0
Explains concepts in an understandable way,0
Use of presentations,0
Degree of difficulty of assignments,0
Solves doubts willingly,0
Structuring of the course,0
Provides support for students going above and beyond,0
Course recommendation based on relevance,0


In [8]:
# Remove duplicates
df.drop_duplicates(inplace=True)

# Fill missing numeric columns with median
for col in df.select_dtypes(include=[np.number]).columns:
    df[col].fillna(df[col].median(), inplace=True)

# Fill missing categorical columns with mode
for col in df.select_dtypes(include=['object']).columns:
    df[col].fillna(df[col].mode()[0], inplace=True)

df.head()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)


Unnamed: 0.1,Unnamed: 0,Student ID,Well versed with the subject,Explains concepts in an understandable way,Use of presentations,Degree of difficulty of assignments,Solves doubts willingly,Structuring of the course,Provides support for students going above and beyond,Course recommendation based on relevance
0,0,340,5,2,7,6,9,2,1,8
1,1,253,6,5,8,6,2,1,2,9
2,2,680,7,7,6,5,4,2,3,1
3,3,806,9,6,7,1,5,9,4,6
4,4,632,8,10,8,4,6,6,9,9


In [9]:
df.describe(include='all').T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,1001.0,500.0,289.108111,0.0,250.0,500.0,750.0,1000.0
Student ID,1001.0,500.0,289.108111,0.0,250.0,500.0,750.0,1000.0
Well versed with the subject,1001.0,7.497502,1.692998,5.0,6.0,8.0,9.0,10.0
Explains concepts in an understandable way,1001.0,6.081918,2.597168,2.0,4.0,6.0,8.0,10.0
Use of presentations,1001.0,5.942058,1.415853,4.0,5.0,6.0,7.0,8.0
Degree of difficulty of assignments,1001.0,5.430569,2.869046,1.0,3.0,5.0,8.0,10.0
Solves doubts willingly,1001.0,5.474525,2.874648,1.0,3.0,6.0,8.0,10.0
Structuring of the course,1001.0,5.636364,2.920212,1.0,3.0,6.0,8.0,10.0
Provides support for students going above and beyond,1001.0,5.662338,2.89169,1.0,3.0,6.0,8.0,10.0
Course recommendation based on relevance,1001.0,5.598402,2.886617,1.0,3.0,6.0,8.0,10.0


In [11]:
if 'Rating_Overall' in df.columns:
    plt.figure(figsize=(6,4))
    sns.histplot(df['Rating_Overall'], bins=5, kde=True)
    plt.title("Distribution of Overall Ratings")
    plt.xlabel("Rating")
    plt.ylabel("Count")
    plt.show()


In [12]:
if 'Event_Name' in df.columns:
    plt.figure(figsize=(8,4))
    sns.barplot(x='Event_Name', y='Rating_Overall', data=df, estimator='mean')
    plt.title('Average Rating by Event')
    plt.xticks(rotation=25)
    plt.show()

if 'Department' in df.columns:
    plt.figure(figsize=(8,4))
    sns.barplot(x='Department', y='Rating_Overall', data=df, estimator='mean')
    plt.title('Average Rating by Department')
    plt.xticks(rotation=25)
    plt.show()


In [13]:
df.to_csv("student_feedback_cleaned.csv", index=False)
print("✅ Cleaned data saved as student_feedback_cleaned.csv")


✅ Cleaned data saved as student_feedback_cleaned.csv
