##Student Performance Indicator
####Lifecycle of Machine Learning Project

*   Understanding the Problem Statement
*   Data Collection
*   Data Checks to perform
*   EDA
*   Data Pre Processing
*   Model Training
*   Choose Best Model



### 1) Problem Statement
*   This project understands how the student's performance is affcted by other variables like Gender, Ethinicity, Parental Education Levels and Test *Preparation* Course

### 2) Data Collection
*   Dataset Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977

*   The data consists of 8 columns and 1000 rows.

### 2.1 Import Data and required Packages

Importing Pandas, Numpy, Matplotlib, Seaborn and Warnings Library.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Import CSV Data as Pandas DataFrame



In [None]:
df = pd.read_csv('/content/stud.csv')

Checking Records



In [None]:
df.head()

In [None]:
df.columns

####Dataset Information



*   gender : sex of students -> (Male/female)
*   race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)
*   parental level of education : parents' final education ->(bachelor's degree,some college,master's degree,associate's degree,high school)
*   lunch : having lunch before test (standard or free/reduced)
*   test preparation course : complete or not complete before test
*   math score
*   reading score
*   writing score


###Data Checks to perform
*  Check Missing values
*  Check Duplicates
* Check data type
* Check the number of unique values of each column
* Check statistics of data set
* Check various categories present in the different categorical column

####3.1 Check Missing Values

In [None]:
df.isnull().sum()

There are no misisng values in the data set


####3.2 Check Duplicates

In [None]:
df.duplicated().sum()

There are no duplicate values in the dataset

####3.3 Check data types

In [None]:
df.info()

In [None]:
df.nunique()

####3.4 Checking statistics of dataset

In [None]:
df.describe()

####Insights
* All means are close to each other; between 66 and 69
* All standard deviations are also close; between 14.6 and 15.19
* Minimum score for:
    * Math: 0
    * Reading: 17
    * Writing: 18

####3.7 Exploring Data

In [None]:
df.head()

In [None]:
print("Categories in 'gender' variable: ", end = '')
print(df['gender'].unique())

print("Categories in 'race_ethnicity' variable: ", end = '')
print(df['race_ethnicity'].unique())

print("Categories in 'lunch' variable: ", end = '')
print(df['lunch'].unique())

print("Categories in 'test_preparation_course	' variable: ", end = '')
print(df['test_preparation_course'].unique())

print("Categories in 'parental_level_of_education' variable: ", end = '')
print(df['parental_level_of_education'].unique())

In [None]:
# Defining numerical and categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

#printing columns
print('We have {} numerical features :{}'.format(len(numeric_features), numeric_features))
print('\n We have {} categorical features: {}'.format(len(categorical_features), categorical_features))

###3.8 Adding columns for "Total Score" and "Average"

In [None]:
df['total_score'] = df['math_score'] + df['writing_score'] + df['reading_score']
df['average'] = round(df['total_score'] / 3, 2)
df.head()

In [None]:
reading_full = df[df['reading_score'] == 100]['average'].count()
writing_full = df[df['writing_score'] == 100]['average'].count()
math_full = df[df['math_score'] == 100]['average'].count()

print(f'Number of students with full marks in Math: ', math_full)
print(f'Number of students with full marks in Writing: ', writing_full)
print(f'Number of students with full marks in reading: ', reading_full)

In [None]:
reading_less = df[df['reading_score'] < 20]['average'].count()
writing_less = df[df['writing_score'] < 20]['average'].count()
math_less = df[df['math_score'] < 20]['average'].count()

print(f'Number of students with marks less than 20 in Math: ', math_less)
print(f'Number of students with marks less than 20 in Writing: ', writing_less)
print(f'Number of students with marks less than 20 in Reading: ', reading_less)


####Insights:
* People have performed the worst in Math
* Best performance in reading section


###4. Exploring Data (Visualization)
#### 4.1 Visualize average score distribution to make some conclusion.
* Histogram
* Kernel Distribution Function (KDE)

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (15, 7))

plt.subplot(121)
sns.histplot(data = df, x = 'total_score', bins = 30, kde = True, color = 'g')

plt.subplot(122)
sns.histplot(data = df, x = 'total_score', kde = True, hue = 'gender')

plt.show()

Female students tend to perform better

In [None]:
plt.subplots(1, 3, figsize = (25, 6))
plt.subplot(131)

sns.histplot(data = df, x = 'average', kde = True, hue = 'lunch')
plt.subplot(132)

sns.histplot(data = df[df.gender == 'female'], x = 'average', kde = True, hue = 'lunch')
plt.subplot(133)

sns.histplot(data = df[df.gender == 'male'], x = 'average', kde = True, hue = 'lunch')
plt.show()

* Standard lunch helps perform better in exams, male or female

In [None]:
plt.subplots(1, 3, figsize = (25, 6))
plt.subplot(131)

sns.histplot(data = df, x = 'average', kde = True, hue = 'race_ethnicity')
plt.subplot(132)

sns.histplot(data = df[df.gender == 'female'], x = 'average', kde = True, hue = 'race_ethnicity')
plt.subplot(133)

sns.histplot(data = df[df.gender == 'male'], x = 'average', kde = True, hue = 'race_ethnicity')
plt.show()


Group A and B tend to perform poorly irrespective of gender[link text]

####4.2 Maximum score of students in all 3 subjects

In [None]:
plt.figure(figsize = (18, 8))

plt.subplot(1, 3, 1)
plt.title('Math Scores')
sns.violinplot(y='math_score', data = df, color = 'red', linewidth = 3)
plt.yticks(range(0,100, 10))

plt.subplot(1, 3, 2)
plt.title('Reading Scores')
sns.violinplot(y = 'reading_score', data = df, color = 'green', linewidth = 3)
plt.yticks(range(0, 100, 10))

plt.subplot(1, 3, 3)
plt.title('Writing Scores')
sns.violinplot(y = 'writing_score', data = df, color = 'blue', linewidth= 3)
plt.yticks(range(0, 100, 10))

plt.show()

####Insights
* From the above 3 plots its clearly visible that most of the students score:
* Math: between 60 and 70
* Reading: between 70 and 80
* Writing: between 65 and 80

####4.3 Multivariate analysis using pieplot

In [None]:
plt.rcParams['figure.figsize'] = (30, 12)

plt.subplot(151)
count = df['gender'].value_counts()
labels = 'Female', 'Male'
color = ['Red', 'Orange']

plt.pie(count, colors = color, labels = labels, autopct = '.%2f%%')
plt.title('Gender', fontsize = 20)
plt.axis('off')

plt.subplot(152)
size = df['race_ethnicity'].value_counts()
labels = 'Group C', 'Group D','Group B','Group E','Group A'
color = ['red', 'green', 'blue', 'cyan', 'orange']

plt.pie(size, colors = color, labels = labels, autopct = '.%2f%%')
plt.title('Race/Ethinicity', fontsize = 20)
plt.axis('off')

plt.subplot(153)
sizes = df['lunch'].value_counts()
labels = sizes.index
color = ['red', 'green']

plt.pie(sizes, colors = color, labels = labels, autopct = ".%2f%%")
plt.title('Lunch', fontsize = 20)
plt.axis('off')


plt.subplot(154)
size = df['parental_level_of_education'].value_counts()
labels = 'Some College', "Associate's Degree",'High School','Some High School',"Bachelor's Degree","Master's Degree"
color = ['red', 'green', 'blue', 'cyan', 'yellow', 'orange']

plt.pie(size, colors = color, labels = labels, autopct = '%.2f%%')
plt.title('Parental Education', fontsize = 20)
plt.axis('off')

plt.subplot(155)
counts = df['test_preparation_course'].value_counts()
label = counts.index
colors = ['red', 'yellow']

plt.pie(counts, colors = colors, labels = label, autopct = '.%2f%%')
plt.title('Test Course', fontsize = 20)
plt.axis('off')

plt.tight_layout()


plt.show()

####Insights
* Number of Male and Female students is almost equal
* Number students are greatest in Group C
* Number of students who have standard lunch are greater
* Number of students who have not enrolled in any test preparation course is greater
* Number of students whose parental education is "Some College" is greater followed closely by "Associate's Degree

#### 4.4 Feature Wise Visualization


#### 4.4.1 Gender Column
* How is distribution of Gender
* Has gender had an effect on student's performance

###Univariate Analysis (How is distribution of Gender)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Bar plot on the first subplot
sns.countplot(x='gender', data=df, palette='Set2', ax=axes[0])
axes[0].set_title('Gender Count')

# Pie chart on the second subplot
df['gender'].value_counts().plot.pie(
    ax=axes[1],
    autopct='%1.1f%%',
    colors=['#66b3ff', '#ff9999'],
    labels=['Male', 'Female'],
    explode=[0, 0.1],
    shadow=True
)
axes[1].set_title('Gender Distribution')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

####Gender has balanced data with female students are 518 (48%) and male students are 482 (52%)


###BIVARIATE ANALYSIS
Has gender had an effect on student's performance

In [None]:
gender_group = df.groupby('gender').mean(numeric_only = True)
gender_group

In [None]:
plt.figure(figsize = (10, 8))

X = ['Total Average', 'Math Average']

female_scores = [gender_group['average'][0], gender_group['math_score'][0]]
male_scores = [gender_group['average'][1], gender_group['math_score'][1]]

X_axis = np.arange(len(X))

plt.bar(X_axis - 0.2, male_scores, 0.4, label = 'Male')
plt.bar(X_axis + 0.2, female_scores, 0.4, label = 'Female')

plt.xticks(X_axis, X)
plt.ylabel("Marks")
plt.title("Total average v/s Math average marks of both the genders", fontweight='bold')
plt.legend()
plt.show()



Insights
* On an average females have a better overall score than men.
* Males have scored higher in Maths.


####4.4.2 RACE/ETHINICITY COLUMN
* How is Group wise distribution ?
* Is Race/Ethinicity has any impact on student's performance ?
####UNIVARIATE ANALYSIS ( How is Group wise distribution ?)



In [None]:
fig, axes = plt.subplots(1, 2, figsize = (14, 6))

sns.countplot(x = 'race_ethnicity', data = df, ax = axes[0], palette = 'Set2')
axes[0].set_title('Race Count')
axes[0].bar_label(axes[0].containers[0], fontsize = 12)

df['race_ethnicity'].value_counts().plot.pie(
    ax = axes[1],
    autopct = "%1.1f%%",
    labels = ['Group C', 'Group D','Group B','Group E','Group A'],
    color = ['red', 'green', 'blue', 'cyan', 'orange'],
    explode = [0.1, 0.1, 0.1, 0.1, 0.1],
    shadow = True
)
axes[1].set_title('Gender Distribution')
axes[1].set_ylabel('')  # Hide y-axis label

plt.tight_layout()
plt.show()

* Most of the student belonging from group C /group D.
* Lowest number of students belong to groupA.

In [None]:
race_group = df.groupby('race_ethnicity').mean(numeric_only=True)
race_group

In [None]:
Group_data2=df.groupby('race_ethnicity')
f,ax=plt.subplots(1,3,figsize=(20,8))
sns.barplot(x=Group_data2['math_score'].mean().index,y=Group_data2['math_score'].mean().values,palette = 'mako',ax=ax[0])
ax[0].set_title('Math score',color='#005ce6',size=20)

for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=15)

sns.barplot(x=Group_data2['reading_score'].mean().index,y=Group_data2['reading_score'].mean().values,palette = 'flare',ax=ax[1])
ax[1].set_title('Reading score',color='#005ce6',size=20)

for container in ax[1].containers:
    ax[1].bar_label(container,color='black',size=15)

sns.barplot(x=Group_data2['writing_score'].mean().index,y=Group_data2['writing_score'].mean().values,palette = 'coolwarm',ax=ax[2])
ax[2].set_title('Writing score',color='#005ce6',size=20)

for container in ax[2].containers:
    ax[2].bar_label(container,color='black',size=15)

* Group E students have scored the highest marks.
* Group A students have scored the lowest marks.
* Students from a lower Socioeconomic status have a lower avg in all course subjects.

###4.4.3 PARENTAL LEVEL OF EDUCATION COLUMN
* What is educational background of student's parent ?
* Is parental education has any impact on student's performance ?
###UNIVARIATE ANALYSIS ( What is educational background of student's parent ? )

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Univariate
sns.countplot(y='parental_level_of_education', data=df, ax=axes[0], palette='coolwarm')
axes[0].set_title('Parental Education Count')
axes[0].bar_label(axes[0].containers[0], fontsize=10)

# Bivariate
df['parental_level_of_education'].value_counts().plot.pie(
    ax=axes[1],
    autopct='%1.1f%%',
    labels=df['parental_level_of_education'].value_counts().index,
    colors=sns.color_palette('pastel'),
    explode=[0.05]*len(df['parental_level_of_education'].unique()),
    shadow=True
)
axes[1].set_title('Parental Education Distribution')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()


Largest number of parents are from some college.

In [None]:
parent_edu_group = df.groupby('parental_level_of_education').mean(numeric_only=True)
parent_edu_group



###BIVARIATE ANALYSIS ( Is parental education has any impact on student's performance ? )

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,8))
sns.countplot(x=df['parental_level_of_education'],data=df,palette = 'bright',hue='parental_level_of_education',saturation=0.95,ax=ax[0])
ax[0].set_title('Students vs test preparation course ',color='black',size=25)
for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=20)

sns.countplot(x=df['parental_level_of_education'],data=df,palette = 'bright',hue='lunch',saturation=0.95,ax=ax[1])
for container in ax[1].containers:
    ax[1].bar_label(container,color='black',size=20)

The score of student whose parents possess master and bachelor level education are higher than others.

###.4.4 LUNCH COLUMN
* Which type of lunch is most common amoung students ?
* What is the effect of lunch type on test results?
###UNIVARIATE ANALYSIS ( Which type of lunch is most common amoung students ? )

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Univariate
sns.countplot(x='lunch', data=df, ax=axes[0], palette='Accent')
axes[0].set_title('Lunch Type Count')
axes[0].bar_label(axes[0].containers[0], fontsize=12)

# Bivariate
df['lunch'].value_counts().plot.pie(
    ax=axes[1],
    autopct='%1.1f%%',
    labels=df['lunch'].value_counts().index,
    colors=['#c2c2f0', '#ffb3e6'],
    explode=[0, 0.1],
    shadow=True
)
axes[1].set_title('Lunch Type Distribution')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()


Students being served Standard lunch was more than free lunch


###BIVARIATE ANALYSIS ( Is lunch type intake has any impact on student's performance ? )

In [None]:
lunch_group = df.groupby('lunch').mean(numeric_only=True)
lunch_group

In [None]:

f,ax=plt.subplots(1,2,figsize=(20,8))
sns.countplot(x=df['parental_level_of_education'],data=df,palette = 'bright',hue='test_preparation_course',saturation=0.95,ax=ax[0])
ax[0].set_title('Students vs test preparation course ',color='black',size=25)
for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=20)

sns.countplot(x=df['parental_level_of_education'],data=df,palette = 'bright',hue='lunch',saturation=0.95,ax=ax[1])
for container in ax[1].containers:
    ax[1].bar_label(container,color='black',size=20)

Students who get Standard Lunch tend to perform better than students who got free/reduced lunch
###4.4.5 TEST PREPARATION COURSE COLUMN
* Which type of lunch is most common amoung students ?
* Is Test prepration course has any impact on student's performance ?


### BIVARIATE ANALYSIS ( Is Test prepration course has any impact on student's performance ? )

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Univariate
sns.countplot(x='test_preparation_course', data=df, ax=axes[0], palette='spring')
axes[0].set_title('Test Prep Course Count')
axes[0].bar_label(axes[0].containers[0], fontsize=12)

# Bivariate
df['test_preparation_course'].value_counts().plot.pie(
    ax=axes[1],
    autopct='%1.1f%%',
    labels=df['test_preparation_course'].value_counts().index,
    colors=['#aaffc3', '#ffd8b1'],
    explode=[0, 0.1],
    shadow=True
)
axes[1].set_title('Test Prep Course Distribution')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()



### 4.4.5 TEST PREPARATION COURSE COLUMN
* Which type of lunch is most common amoung students ?
* Is Test prepration course has any impact on student's performance ?

###Bivariate Analysis
(Does Test prepration course have any impact on student's performance?)

In [None]:
test_prep_group = df.groupby('test_preparation_course').mean(numeric_only=True)
test_prep_group


Students who have completed the Test Prepration Course have scores higher in all three categories than those who haven't taken the course

####4.4.6 Checking for Outliers

In [None]:
plt.subplots(1,4,figsize=(16,5))
plt.subplot(141)
sns.boxplot(df['math_score'],color='skyblue')
plt.subplot(142)
sns.boxplot(df['reading_score'],color='hotpink')
plt.subplot(143)
sns.boxplot(df['writing_score'],color='yellow')
plt.subplot(144)
sns.boxplot(df['average'],color='lightgreen')
plt.show()

###4.4.7 Multivariate analysis using Plot

In [None]:
sns.pairplot(df,hue = 'gender')
plt.show()

Scores linearly increase with each other

###5. Conclusions
* Student's Performance is related with lunch, race, parental level education
* Females lead in pass percentage and also are top-scorers
* Student's performance is not related with test preparation course
* Finishing the preparation course is benefitial.