***Reading data into R***
* lets read the data into R and load the desired packages

In [None]:
library(ggplot2)
library(dplyr)
library(readr)
library(tidyverse)
library(car)
Student_data <- read.csv("https://github.com/dsrscientist/dataset4/blob/main/Grades.csv", header=TRUE, sep=",")
summary(Student_data)#summary of variables

In [None]:
summary(Student_data$G3) 

histogram of final grade

In [None]:
hist(Student_data$G3, 
     main="Histogram for Final Grade G3", 
     xlab="Students' final grade", 
     border="blue", 
     col="green",
     las=1, 
     breaks=15)

The above histogram shows that there is unusual number of zeros in student's grade which could be due to absence of students or disqualifying, otherwise the distribution looks close to normal

now lets look at the frequency table of final grade(G3)

In [None]:
table(Student_data$G3)

**Variable Gender**

In [None]:
 ggplot(Student_data, aes(x = sex, y = G3)) + 
  geom_boxplot(aes(fill = sex),alpha = .6,size = 1) + 
  scale_fill_brewer(palette = "Spectral") + 
  stat_summary(fun.y = "mean", geom = "point", shape= 23, size= 3, fill= "white") +
  ggtitle("Grade distribution by gender") + 
  theme(axis.title.y=element_blank()) + theme(axis.title.x=element_blank())
 

The above plot shows that there is a difference in grades of males and female students but we cant say this with certainty. To achieve this we will move onto hypothesis testing and use t.test function to comapair the grade difference between the two genders. 

*** H0 : There is no difference in final grades of male and female students.
* H1 : There is significant difference between male and female student final grades.**

In [None]:
#testing hypothesis
 ht1<- t.test(G3~sex,data=Student_data,subset=sex%in%c('M','F'))
 ht1

By looking the results of the t test, there is an evidence that male students' final grades are significantly higher than femal students at 0.05 alpha level
 

**Variable Age**
plotting bar chart of age

In [None]:
ggplot(Student_data,aes(x=factor(age)))+geom_bar(stat="count",width=0.7,fill="steelblue")+theme_minimal()
 

**Age vs final grade**

In [None]:
 Student_data$age=factor(Student_data$age)
  ggplot(Student_data, aes(x=age, y=G3, fill=age)) + geom_boxplot()+labs(title="Plot of final Grades by age",x="age", y = "final Grade")+
   theme_classic()


The above plot shows that the median grades of the three age groups(15,16,17) are similar. Note the skewness of age group 19. (may be due to sample size).
Age group 20 seems to score highest grades among all.


**ANOVA** to find the difference among the age and their grades

In [None]:
aov1 <- aov(G3 ~ age, data = Student_data)
 TukeyHSD(aov1)

The output shows that there is no difference in final grades among different age groups. Note the Adj p values which are all higher than 0.05.

Checking the assumption of ANOVA (homogeniety of variance)


In [None]:
library(car)
leveneTest(G3 ~ age, data = Student_data)

The output indicates that we can assume equality of variances. Note the p value (0.74) higher than 0.05

**Variable Father's education (Fedu)**


In [None]:
Student_data$Fedu=factor(Student_data$Fedu)
 ggplot(Student_data, aes(x = Fedu, y = G3)) + 
   geom_boxplot(aes(fill = Fedu),alpha = .6,size = 1) + 
   scale_fill_brewer(palette = "Set2") + 
   stat_summary(fun.y = "mean", geom = "point", shape= 23, size= 3, fill= "white") +
   ggtitle("Grade distribution by father's edu") + 
   theme(axis.title.y=element_blank()) + theme(axis.title.x=element_blank())

The above graph shows that there is difference in grades depending on fathers education but we dont know if :
1. The difference is significant.
1. Exactly what groups differ in their grades.
To address this, we move on to ANOVA method to locate any significant differences.


In [None]:
res.aov <- aov(G3 ~ Fedu, data = Student_data)
 # Summary of the analysis
 summary(res.aov)
 TukeyHSD(res.aov)

The output shows that only the groups 4-1 differ in their grades (note the adjusted p values less than 0.05).
 The students whose fathers are highly educated perform better in comparison to the ones whose father are just primary educated.

**Checking the assumption of equal varaince**


In [None]:
leveneTest(G3 ~ Fedu, data = Student_data)

From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.

**Mother's Education and Student's performance**


In [None]:
Student_data$Medu=factor(Student_data$Medu)
 p4 <- ggplot(Student_data, aes(x = Medu, y = G3)) + 
   geom_boxplot(aes(fill = Medu),alpha = .6,size = 1) + 
   scale_fill_brewer(palette = "Spectral") + 
   stat_summary(fun.y = "mean", geom = "point", shape= 23, size= 3, fill= "white") +
   ggtitle("Grade distribution by Mother's education") + 
   theme(axis.title.y=element_blank()) + theme(axis.title.x=element_blank())
 p4

The box plot above shows that the only the distribution of group (0 ie primary educated) is skewed which may be due to small group size. Other groups seem to be nearly normal.


Applying ANOVA to find and locate exactly which groups differ significantly.

In [None]:
res.aov2 <- aov(G3 ~ Medu, data = Student_data)
 TukeyHSD(res.aov2)

The above output shows that only the groups 4-1 and 4-2 are significantly different (note the adj p-values less than 0.05).
Highly educated mothers children score significatly higher than those of  elementary and middle educated mothers. 
**checking the anova assumption of equal variance :**

In [None]:
leveneTest(G3 ~ Medu,data=Student_data)

P value is greater than 0.05 which indicates homogieniety of variance.


**Effect of mother's profession on final grades**

In [None]:
p5 <- ggplot(Student_data, aes(x = Mjob, y = G3)) + 
   geom_boxplot(aes(fill = Mjob),alpha = .6,size = 1) + 
   scale_fill_brewer(palette = "Set2") + 
   stat_summary(fun.y = "mean", geom = "point", shape= 23, size= 3, fill= "white") +
   ggtitle("Grade distribution by Mother's profession") + 
   theme(axis.title.y=element_blank()) + theme(axis.title.x=element_blank())
p5

The median scores of students whose moms are health professionals have relatively higher grades than others. Also, students whose moms stay at home have least median scores.

**Lets see which groups differ in grades with respect to mother's profession**


In [None]:
res.aov3 <- aov(G3 ~ Mjob, data = Student_data)
 TukeyHSD(res.aov3)

The groups health care professional and stay at home differ significantly.

levene's test of equal variances ( checking the assumption of ANOVA)


In [None]:
 leveneTest(G3~Mjob,data=Student_data)

The test show that the varainces are equal since p-value is greater than 0.05

**Father's profession effect on final grades**

In [None]:
p6 <-ggplot(Student_data, aes(x = Fjob, y = G3)) + 
  geom_boxplot(aes(fill = Fjob),alpha = .6,size = 1) + 
  scale_fill_brewer(palette = "Set1") + 
  stat_summary(fun.y = "mean", geom = "point", shape= 23, size= 3, fill= "white") +
  ggtitle("Grade distribution by father's profession") + 
  theme(axis.title.y=element_blank()) + theme(axis.title.x=element_blank())
p6

The above graph shows that thers is not much difference in student's grade with respect to father's profession(notice the medians of first four factors). Only the students whose dads are teachers seem to have Higher median grades but at the same time the grades are much more dispersed than other professions.

**ANOVA to locate the differences**

In [None]:
res.aov4 <- aov(G3 ~ Fjob, data = Student_data)
 TukeyHSD(res.aov4)

The results above show that none of the groups vary with respect to final grades**. Father's profession does not seem to effect the grades of the students.**

In [None]:
 leveneTest(G3~Fjob,data=Student_data)

Notice the p value which is higher than 0.05. Therefore , we can assume that variances are equal.
