# 1) Problem statement
This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.


# 2) Data Collection
Dataset Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
The data consists of 8 column and 1000 rows.

### Import Necessary Libraries 

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Import data as dataframe

In [9]:
data= pd.read_csv("D:\Study\Data Science\Python\ineuron\Data_Set\Student_dataset\stud.csv")
data.head(5)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


#### Show the shape of the dataset

In [20]:
data.shape

(1000, 8)

## 3.  Dataset information

Dataset contains following features
1. gender -> It determines the gender of the Student
2. race/ethnicity -> It determines the different groups from A to E to which student belongs too
3. Parental Level of Education -> It determines the education of Parent for each student
4. Lunch -> It determines what kind of lunch students were taking, whether its Standard or free
5. Test Preparation Course -> Whehter student had completed the test preparation course  or not
6. Maths Score -> Maths score of each student
7. Reading Score -> Reading score of each student
8. Writing Score -> Writing score of each student




### 4. Data Checks to perform
1. Check Missing values
2. Check Duplicates
3. Check data type
4. Check the number of unique values of each column
5. Check statistics of data set
6. Check various categories present in the different categorical column

#### 4.1 To Check if any features has null values in it


In [23]:
data.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

####  Remarks : There are no Missing Values in any of the features

#### 4.2 To Check for duplicates

In [33]:
data.duplicated().sum()

0

####  Remarks : There are no duplicate rows in the dataset

#### 4.3 To Check datatypes of each features

In [37]:
 data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


#### 4.4 Check the number of unique values of each column

In [38]:
data.nunique()

gender                          2
race/ethnicity                  5
parental level of education     6
lunch                           2
test preparation course         2
math score                     81
reading score                  72
writing score                  77
dtype: int64

In [39]:
# Different values in Race/Ethnicity
data['race/ethnicity'].unique()

array(['group B', 'group C', 'group A', 'group D', 'group E'],
      dtype=object)

In [15]:
# Different values in parental level of education
data['parental level of education'].unique()

array(["bachelor's degree", 'some college', "master's degree",
       "associate's degree", 'high school', 'some high school'],
      dtype=object)

In [16]:
# Different values in lunch
data['lunch'].unique()

array(['standard', 'free/reduced'], dtype=object)

In [17]:
# Different values in test preparation course
data['test preparation course'].unique()

array(['none', 'completed'], dtype=object)

#### 4.5. Check statistics of data set

In [10]:
data.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


#### Remarks:
    1. All numerical values have mean between 66 and 69
    2. Standard deviation for numerical values lines betwenn 14.5 and 15.20
    3. Maths has minimum score of 0 and Maximum of 100
    4. Reading has minimum score of 17 and maximum of 100
    5. Writing has minimum score of 10 amd maximum of 100

### 4.6 Segregating the data based on datatype

In [93]:
data[data.dtypes[data.dtypes=='object'].index].head(5)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course
0,female,group B,bachelor's degree,standard,none
1,female,group C,some college,standard,completed
2,female,group B,master's degree,standard,none
3,male,group A,associate's degree,free/reduced,none
4,male,group C,some college,standard,none


In [94]:
data[data.dtypes[data.dtypes!='object'].index].head(5)

Unnamed: 0,math score,reading score,writing score
0,72,72,74
1,69,90,88
2,90,95,93
3,47,57,44
4,76,78,75


In [80]:
data_numerical=[]
for feature in data.columns:
    if data[feature].dtype != 'O':
        data_numerical.append(feature) 

        
        
data_categorical=[]
for feature in data.columns:
    if data[feature].dtype == 'O':
        data_categorical.append(feature) 

In [79]:
data_numerical

['math score', 'reading score', 'writing score']

In [81]:
data_categorical

['gender',
 'race/ethnicity',
 'parental level of education',
 'lunch',
 'test preparation course']

In [82]:
# define numerical & categorical columns
numeric_features = [feature for feature in data.columns if data[feature].dtype != 'O']
categorical_features = [feature for feature in data.columns if data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('We have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 3 numerical features : ['math score', 'reading score', 'writing score']
We have 5 categorical features : ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']


#### 4.7 Creating 2 new feature ,Total Score(sum of 3 score features) and Average score

In [96]:
data['Total'] = data['math score'] + data['reading score'] + data['writing score']
data['Average'] = data['Total']/3
data.head(2)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Total,Average
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333


#### 4.8  Printing Top Scores

In [128]:
print(f"Number of Students who has scored 100 in any one the subjects :{data[(data['math score']==100) |(data['reading score']== 100) | (data['writing score']== 100)]['gender'].count()}")
data[(data['math score']==100) |(data['reading score']== 100) | (data['writing score']== 100)]

Number of Students who has scored 100 in any one the subjects :23


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Total,Average
106,female,group D,master's degree,standard,none,87,100,100,287,95.666667
114,female,group E,bachelor's degree,standard,completed,99,100,100,299,99.666667
149,male,group E,associate's degree,free/reduced,completed,100,100,93,293,97.666667
165,female,group C,bachelor's degree,standard,completed,96,100,100,296,98.666667
179,female,group D,some high school,standard,completed,97,100,100,297,99.0
377,female,group D,master's degree,free/reduced,completed,85,95,100,280,93.333333
381,male,group C,associate's degree,standard,completed,87,100,95,282,94.0
403,female,group D,high school,standard,completed,88,99,100,287,95.666667
451,female,group E,some college,standard,none,100,92,97,289,96.333333
458,female,group E,bachelor's degree,standard,none,100,100,100,300,100.0


In [129]:
print(f"Number of students who has scored 100 in both Maths and Reading are {data[(data['math score']==100) & (data['reading score']== 100)]['gender'].count()}")
data[(data['math score']==100) & (data['reading score']== 100)]

Number of students who has scored 100 in both Maths and Reading are 4


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Total,Average
149,male,group E,associate's degree,free/reduced,completed,100,100,93,293,97.666667
458,female,group E,bachelor's degree,standard,none,100,100,100,300,100.0
916,male,group E,bachelor's degree,standard,completed,100,100,100,300,100.0
962,female,group E,associate's degree,standard,none,100,100,100,300,100.0


In [130]:
print(f"Number of students who has scored 100 in all 3 subjects are {data[(data['math score']==100) & (data['reading score']== 100) & (data['writing score']== 100)]['gender'].count()}")
data[(data['math score']==100) & (data['reading score']== 100) & (data['writing score']== 100)]

Number of students who has scored 100 in all 3 subjects are 3


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Total,Average
458,female,group E,bachelor's degree,standard,none,100,100,100,300,100.0
916,male,group E,bachelor's degree,standard,completed,100,100,100,300,100.0
962,female,group E,associate's degree,standard,none,100,100,100,300,100.0
