## Mini Project 1: 

### Description

This project aims to observe the relationships between various students' socio-economic circumstances and overall satisfaction regarding the college experience. Exploring the relationships between said circumstances and college satisfaction allows us to conclude any impact or significance regarding these variables.

In [1]:
import pandas as pd
from scipy.stats import chi2_contingency
import numpy as np
import plotly.express as plx
import plotly.graph_objects as go
import seaborn as sns

df = pd.read_csv('data.csv')

# potential questions data frame, cleaning up column names
pqdf = df.iloc[:, [2, 3, 7, 14, 17, 23, 31, 32, 37, 38, 50, 51, 52, 80]]
pqdf = pqdf.rename(columns={
    pqdf.columns[0]: 'Age',
    pqdf.columns[1]: 'Gender',
    pqdf.columns[2]: 'GenImmigrant',
    pqdf.columns[3]: 'CollegeYear',
    pqdf.columns[4]: 'GPA',
    pqdf.columns[5]: 'CurrentFeelCol',
    pqdf.columns[6]: 'GenStudent',
    pqdf.columns[7]: 'ParentEdu',
    pqdf.columns[8]: 'LecRequired',
    pqdf.columns[9]: 'LecAttended',
    pqdf.columns[10]: 'FinAid',
    pqdf.columns[11]: 'StudentLoans',
    pqdf.columns[12]: 'AnnualIncome',
    pqdf.columns[13]: 'CurrentFeelQtr',
})
pqdf.head()

Unnamed: 0,Age,Gender,GenImmigrant,CollegeYear,GPA,CurrentFeelCol,GenStudent,ParentEdu,LecRequired,LecAttended,FinAid,StudentLoans,AnnualIncome,CurrentFeelQtr
0,22,Male,1st generation,Senior,3.0 - 3.49,3.0,1st generation (first to attend),Masters,8.0,4,Yes,No,50k - 150k,1.0
1,22,Female,1st generation,Junior,3.0 - 3.49,4.0,2nd generation (parents attended college),PhD,9.0,9,Yes,No,50k - 150k,2.0
2,27,Male,3rd generation or more,Senior,3.5 - 4.0,3.0,2nd generation (parents attended college),Bachelors,8.0,7,Yes,No,Less than 50k,4.0
3,24,Male,1st generation,Senior,3.5 - 4.0,3.0,1st generation (first to attend),High school,11.0,8,Yes,Yes,Less than 50k,3.0
4,40,Female,1st generation,Junior,0 - 1.49,3.0,1st generation (first to attend),High school,6.0,3,Yes,Yes,Less than 50k,1.0


### Questions 1 and 2
<!-- To what extent do social-economic circumstances affect one's academic performance and personal satisfaction with their college experience?
1. Gather which questions have to do with social-economic circumstances
11. Gather questions relating to college experience/happiness
111. Map questions from part 1 to part 2 -->

#### Data

The data that we are interested in aims to capture the general socio-economic background of the populous. The data collected about each participant in the survey can be categorized into the following groups:

* Socio-Economic Background
    * Age
    * Gender
    * Financial Aid amount
    * Student Loans
    * Immigrant generation
    * College generation
    * Highest level of education held by parents
    * Household annual income
* College Satisfaction/Performance
    * Cumulative GPA
    * College experience satisfaction levels
    * Feelings about current quarter
    * Stress level
    * Lectures per week
    * Lectures attended

#### Questions


Using this data set, we begin analysis on the following set of questions:

* Can one's parents' educational experience achieved predict one's GPA/College Satisfaction?
* Does one's annual household income have an effect on GPA/College Satisfaction?
* Does one's gender have any predictive power over one's GPA or satisfaction levels with the college experience?
* Does one's age have an effect on their college Satisfaction?
* Does one's age have an effect on their GPA?
* Does our data sample have a good male and female representation?

### Cleaning Dataset

In [2]:
# Handling NA, replacing with mode
pqdf.GPA.fillna(pqdf.GPA.value_counts().idxmax(),inplace=True)
pqdf.CurrentFeelCol.fillna(pqdf.CurrentFeelCol.value_counts().idxmax(),inplace=True)
pqdf.CurrentFeelCol = pqdf.CurrentFeelCol.astype(int)

# Mapping GPA to group numbers
pqdf.GPA = pd.Categorical(pqdf.GPA, ordered=True, categories=['0 - 1.49', '1.5 - 1.99', '2.0 - 2.49', '2.5 - 2.99', '3.0 - 3.49', '3.5 - 4.0'])
pqdf['GPA_Category'] = pqdf.GPA.cat.codes

In [3]:
# Cleaning ParentEdu values
pqdf.ParentEdu = pqdf["ParentEdu"].str.lower().str.capitalize()
pqdf.ParentEdu.fillna(pqdf.ParentEdu.value_counts().idxmax(),inplace=True)
pqdf.ParentEdu = pd.Categorical(pqdf.ParentEdu)
pqdf['ParentEdu_Category'] = pqdf.ParentEdu.cat.codes
#print(pqdf.ParentEdu, pqdf.ParentEdu_Category.min(), pqdf.ParentEdu_Category.max())

# Cleaning GenImmigrant values
pqdf.GenImmigrant = pqdf["GenImmigrant"].str.lower().str.capitalize()
pqdf.GenImmigrant.fillna(pqdf.GenImmigrant.value_counts().idxmax(),inplace=True)
pqdf.GenImmigrant = pd.Categorical(pqdf.GenImmigrant)
pqdf['GenImmigrant_Category'] = pqdf.GenImmigrant.cat.codes

# Cleaning Age values
pqdf.Age.fillna(pqdf.Age.value_counts().idxmax(),inplace=True)
pqdf['Age'] = pqdf.Age.replace('18-21', pqdf.Age.value_counts().idxmax())

# Cleaning CurrentFeelCol values
pqdf['CurrentFeelCollege'] = pqdf.CurrentFeelCol
pqdf['CurrentFeelCollege'] = pqdf.CurrentFeelCollege.replace(5,'Satisfied')
pqdf['CurrentFeelCollege'] = pqdf.CurrentFeelCollege.replace(4,'Somewhat Satisfied')
pqdf['CurrentFeelCollege'] = pqdf.CurrentFeelCollege.replace(3,'Neutral')
pqdf['CurrentFeelCollege'] = pqdf.CurrentFeelCollege.replace(2,'Slightly Dissatisfied')
pqdf['CurrentFeelCollege'] = pqdf.CurrentFeelCollege.replace(1,'Dissatisfied')

In [4]:
#Cleaning for boxplot for finaid vs classes attended, decided to make a dataframe rather than a column as that made it easier
lecvsaid = pd.DataFrame()
lecvsaid["LecAttended"] = pqdf["LecAttended"].replace("all", np.nan).fillna("-.00001").astype(float)
lecvsaid[lecvsaid["LecAttended"]<0] = np.mean(round(lecvsaid["LecAttended"], 1))
lecvsaid["LecRequired"]= pqdf["LecRequired"].replace("0", np.nan).fillna("-.00001").astype(float)
lecvsaid[lecvsaid["LecRequired"]<0] = np.mean(round(lecvsaid["LecRequired"], 1))
lecvsaid["FinAid"] = pqdf["FinAid"].fillna(pqdf["FinAid"].value_counts().idxmax())


### Parent's highest degree achieved vs. GPA/College Satisfaction

In [5]:
# highest degree recieved by either parent vs. student's gpa
fig5 = plx.density_heatmap(pqdf, x=pqdf.ParentEdu, y=pqdf.GPA, text_auto=True, 
    category_orders={"ParentEdu":['Middle school','High school','Associates','Bachelors', 'Masters','Phd'],
                "GPA":['3.5 - 4.0','3.0 - 3.49','2.5 - 2.99','2.0 - 2.49','1.5 - 1.99','0 - 1.49']},
                color_continuous_scale="Viridis",width=1000,title="Parent Educational Attainment vs. Student GPA")
fig5.show()

This visualization aims to provide insight into the relationship between the highest degree recieved by either parent and the student's cumulative GPA, if any exists. From this heatmap, we can see that the majority of students have parents that have achieved either a bachelor's or a master's degree. At first glance, it may seem that if a student's parent achieved a higher degree, that the student would have a higher GPA. However, this may be misleading, as we see that regardless of the degree achieved by the parents, that there are more students with higher GPAs as the GPA range increased. 

The only conclusion that we can concretely draw from this graph is that there are more students with parents that have pursued higher levels of education than those that have not. The sample size of students who's parents did not pursue higher education is significantly smaller, thus making it difficult to draw any conclusions between a student's parents' highest level of education and the student's GPA.

In [6]:
# horizontal scaled bar graph, 
level_list = ['Unsatisfied','Slightly Unsatisfied','Neutral','Slightly Satisfied','Satisfied']
edu_list = ['Middle school','High school','Associates','Bachelors', 'Masters','Phd']
colors = ['rgba(38, 24, 74, 0.8)', 'rgba(71, 58, 131, 0.8)',
          'rgba(122, 120, 168, 0.8)', 'rgba(164, 163, 204, 0.85)',
          'rgba(190, 192, 213, 1)']

pcdf = pd.crosstab(pqdf.CurrentFeelCol,pqdf.ParentEdu,normalize='columns')
pcdf = pcdf.reindex(edu_list,axis="columns")
x_data=pcdf.values.tolist()
fig6 = go.Figure()
for happyindex in range(0,len(x_data)):
    fig6.add_traces([
        go.Bar(
            y=edu_list,x=x_data[happyindex],
            orientation='h', 
            name=level_list[happyindex],
            text=level_list[happyindex],
            marker=dict(color=colors[happyindex])
        )
    ])
fig6.update_layout(barmode='stack',width=1000,title="Parent Educational Attainment vs. Satisfaction Levels, Scaled")
fig6.show()

This scaled horizontal bar chart demonstrates the college satisfaction levels of a student, given the highest degree that either of the student's parents had achieved. From this graph, we can see that at either end of the spectrum (middle school or PHD), none of the students reported that they were extremely satisfied or unsatisfied. 

One interesting observation that the highest proportion of students that considered themselves extremely satisfied with their college experience had parents who achieved an associates degree. As the student's parents pursued higher levels of education, their levels of extreme satisfaction decreases. However, regardless of the student's parent's education, the proportion of students that reported to be mostly satisfied with their college experience seems to remain constant despite their parents' highest level of education.

### Annual Income vs. GPA/College Satisfaction

In [7]:
# Handling NA, replacing with mode
pqdf.AnnualIncome.fillna(pqdf.AnnualIncome.value_counts().idxmax(),inplace=True)

# Mapping Annual Income to group numbers
pqdf.AnnualIncome = pd.Categorical(pqdf.AnnualIncome)
pqdf['AnnualIncome_Category'] = pqdf.AnnualIncome.cat.codes

In [8]:
# TODO: is it possible to make these lines a little thicker?
fig2 = go.Figure(data=
    go.Parcoords(
        line = dict(color = pqdf["AnnualIncome_Category"]),
        dimensions = list([
            dict(range = [0,2], 
                constraintrange = [1,2],
                tickvals = [0,1,2],   
                label = 'Annual Income', values = pqdf.loc[:, "AnnualIncome_Category"],
                ticktext = ['0-50k', '50k-150k', '150k+']),
            dict(range = [0,5],
                tickvals = [0,1,2,3,4,5],
                label = 'GPA', values = pqdf.loc[:, "GPA_Category"],
                ticktext= ["0-1.49", "1.5-1.99", "2.0-2.49", "2.5-2.99", "3.0-3.49", "3.5-4.0"]),
            dict(range = [1,5],
                tickvals = [1,2,3,4,5],
                label = 'College Satisfaction', values = pqdf.loc[:,"CurrentFeelCol"],
                ticktext = ['Dissatisfied', "Slightly Dissatisfied", "Neutral", "Somewhat Satisfied", "Satisfied"]),
        ])
    )
)
fig2.update_layout(
    title = "Parallel Coordinate Mappings of Annual Income to GPA and College Satisfaction"
)
fig2.show()

In [9]:
# Create dimensions
annualIncome_dim = go.parcats.Dimension(values=pqdf.AnnualIncome, label="Annual Income")

gpa_dim = go.parcats.Dimension(values=pqdf.GPA, label="GPA")

satisfaction_dim = go.parcats.Dimension(
    values=pqdf.CurrentFeelCol, label="Satisfaction", categoryarray=[1,2,3,4,5],
    ticktext = ['Dissatisfied', "Slightly Dissatisfied", "Neutral", "Somewhat Satisfied", "Satisfied"]
)

# Create parcats trace
color = pqdf.CurrentFeelCol
fig = go.Figure(data = [go.Parcats(dimensions=[annualIncome_dim, gpa_dim, satisfaction_dim],
        line={'color': color},
        hoveron='category', hoverinfo='count+probability',
        labelfont={'size': 18, 'family': 'Times'},
        tickfont={'size': 16, 'family': 'Times'},
        arrangement='freeform')])

fig.update_layout(
    title = "In depth Parallel Coordinate Mappings of Annual Income to GPA and College Satisfaction",
    width=1000
)

fig.show()

Initially we just wanted to see if there was an impact on family annual income on gpa and from gpa to satisfaction. So we decided to use a parallel_coordinate graphs to see if there was a particular category that showed a specific pattern. The basic parallel coordinate graph showed an indication of such patterns specifically that people with 150k+ annual income had relativley high gpas and had a somewhat positive college experience. On the other hand, families with annual income of 50k-150k seemed to vary all over the place when it came to gpa and college experience. Families with lower than 50k annual income had high gpas but had varied college experiences. Looking at the data I decided to create a better graph that was more interactive and specific.

The more specific graph confirmed the findings for families with annual income of 150k+. For families with annual incomes of 50-150k it appears that most students had high gpas but were neutral and somewhat satisfied for the most part although there were some students that seem dissatisfied with their college experience. This is also rather normal and in lin with the initial graph. The truly surprising data was for students whose family's annual income is less than 50k. Most of the students have a relativley/positive gpa but when it came to college satisfaction something seemed up. It appears that the second most said satisfaction level, right after neutral, is slight disatisfaction with the college experience.

With all of this data in mind it appears that income may not have an effect on gpa, which is not what we expected as we thought students with more money would have access to more higher quality resources and thus a higher ration of above 3.0 or so gpa. As for the college experience category we expected that there wouldn't really be a relationship between income and satisfaction. In fact we were hoping there was a correlation between GPA and Satisfaction but this data not only shows that there is really no correlation but also that there may be a correlation between annual income and satisfaction. Specifically the lower your income the less satisfied you are with your college experience

### Age vs College Satisfaction/GPA

In [10]:
AvsCs= plx.density_heatmap(pqdf, x=pqdf.Age, y=pqdf.CurrentFeelCollege, text_auto=True, 
    category_orders={"Age":['18','19','20','21', '22','23','24','25','27','30','40'],
                "CurrentFeelCollege":['Satisfied','Somewhat Satisfied','Neutral','Slightly Dissatisfied','Dissatisfied']},
                color_continuous_scale="redor",width=1000,title="Overall College Satisfaction vs. Age")
AvsCs.show()

By comparing the data between how students are currently feeling in college versus age using a heat map I can see that there is a concentrated portion of data in the age range of 19 to 21. Looking at those three age ranges I found that most of the student in those age ranges would consider that they have a great college experiences. In this chart there are some outlier like age 18, 30, and 40. Combine these two only have 4 input and it does not do much to influence the overall concentrated portion of the graph. After age range 19 to 21 the input started to spread out more and this is likely because most college student are around age 18-21 and 18 years old usually would not be able to take this uppder division class in the fall. From my observation I concluded that age do have an influence on how you would feel in college because most people in the age range 19 to 21 would choose great.

In [11]:
AvsGPA= plx.density_heatmap(pqdf, x=pqdf.Age, y=pqdf.GPA, text_auto=True, 
    category_orders={"Age":['18','19','20','21', '22','23','24','25','27','30','40'],
                "GPA":['3.5 - 4.0','3.0 - 3.49','2.5 - 2.99','2.0 - 2.49','1.5 - 1.99','0 - 1.49']},
                color_continuous_scale="teal",width=1000,title="GPA vs. Age")
AvsGPA.show()

This graph shown that most of our data are concentrated in between age 19 to 21 and this is because most people who go to college are between age 19 to 21. By looking at the darker colar on the graph I see that there are many students answered that their GPA range is between 3.5 and 4.0. Beside age range from 19 to 21 other ages would still have a mode input of between 3.0 to 4.0 GPA. With this I can conclude that the student age do not have an effect on the student GPA and that most student have a high GPA because this is the CS major.

### Gender vs. College Satisfaction

In [12]:
a = pqdf.groupby(['Gender']).size()
b = pqdf.groupby(['Gender','Age']).size()
c = pqdf.groupby(['Gender','Age','CurrentFeelCol']).size()
d = c['Male']['21']
f = c['Female']['21']
fig = go.Figure()
fig.add_trace(go.Funnel(
    name = 'Male',
    y = ["Gender", "21 years old", "College Experience 4/5"],
    x = [a['Male'], b['Male']['21'], d[4]],
    textposition = "inside",
    textinfo = "value+percent initial"
    ))
fig.add_trace(go.Funnel(
    name = 'Female',
    y = ["Gender", "21 years old", "College Experience 4/5"],
    x = [a['Female'], b['Female']['21'], f[4]],
    textposition = "inside",
    textinfo = "value+percent initial"
    ))
fig.show()

From this funnel chart I am trying to find the students who are 21 years old and have a 4/5 college experience. The reason that I have decided to use gender, age 21, and College experience to find out of the data show a good gender representation is that I found that most student in this class are 21 years old and most of the student who are 21 answered that they have a 4/5 college experience. Using this funnel chart I found that out of 103 student not counting the non-binary/non-conforming there are 77 male and 26 female. There are 25 male and 7 female who are 21 years old. There are 13 male who have a 4/5 college experience and 3 female that have a 4/5 college experience. Using the funnel graph to show I took the data of male and female who are 21 years old and have a 4/5 college experience I can see that even though there are more male than female they have almost the same percent of input for being 21 years old and have 4/5 college experience. With this I conclude that our data have a good gender representation.

### Lectures Attended VS Financial Aid

In [13]:
lecvsaid["NormalizedAttendance"] = lecvsaid["LecAttended"]/lecvsaid["LecRequired"]
lecvsaid.loc[lecvsaid["NormalizedAttendance"] > 1, "NormalizedAttendance"] = 1
fig14 = plx.box(lecvsaid, x="FinAid", y="NormalizedAttendance", points = "all", labels={"FinAid":"Financial Aid Given", "NormalizedAttendance":"Normalized Attendance"}, title="Financial Aid vs Class Attendance(Normalized)", color="FinAid")
fig14.show()


So if we take a look at the diagram there seems to be just a slight correlation. It appears that people with financial aid attend their classes on a lower basis per week. We can see this by the fact that the lower fence and upper fence of people with financial aid is lower than the people without financial aid. So despite the median being the same for the normalized attendance, there appears to be more people under the median that skip classes more, or a few people that skip a lot of classes for those who have financial aid over those that don't. This could be for a variey of reasons such as having a small sample size, but I think the reason for how the graphs look is because people that don't have financial aid are probably more motivated to get their money's worth.

### Generational Status vs. GPA

In [14]:
imGPA = pd.crosstab(pqdf['GPA'], pqdf["GenImmigrant_Category"])
imGPA = imGPA.pipe(lambda x: x.div(x.sum(), axis = 'columns'))
imGPA.rename(columns = {0: "1st Generation", 1: "2nd Generation", 2: "3rd Generation or more", 3: "Native"}, inplace=True)
imGPA = imGPA * 100

In [15]:
imGPA = pd.crosstab(pqdf['GPA'], pqdf["GenImmigrant_Category"])
imGPA = imGPA.pipe(lambda x: x.div(x.sum(), axis = 'columns'))
imGPA.rename(columns = {0: "1st Generation", 1: "2nd Generation", 2: "3rd Generation or more", 3: "Native"}, inplace=True)
fig10 = plx.scatter(imGPA, size = 'value', 
    labels={
        "GPA" : "GPA Ranges",
        "GenImmigrant_Category" : "Immigrant Generational Groupings",
        "value" : "Proportion of Each Immigrant Category"
    },
title = "GPA Distribution for Each Immigrant Group")
fig10.show()

In [16]:
imIncome = pd.crosstab(pqdf['AnnualIncome'], pqdf["GenImmigrant_Category"])
imIncome = imIncome.pipe(lambda x: x.div(x.sum(), axis = 'columns'))
imIncome = imIncome.reindex(['Less than 50k', '50k - 150k', 'More than 150k'])
imIncome.rename(columns = {0: "1st Generation", 1: "2nd Generation", 2: "3rd Generation or more", 3: "Native"}, inplace=True)
fig11 = plx.scatter(imIncome, size = 'value', 
    labels={
        "AnnualIncome" : "Annual Income Ranges",
        "GenImmigrant_Category" : "Immigrant Generational Groupings",
        "value" : "Proportion of Each Immigrant Category"
    },
title = "Annual Family Income Distribution for Each Immigrant Group")
fig11.show()


When looking at the Bubble Chart "GPA Distribution for Each Immigrant Group", it's noticable that 3rd Generation or more immigrants have a greater proportion of high GPA's compared to other generations of immigrants. To understand the reasoning for this, take a look at the "Annual Family Income Distribution for Each Immigrant Group" chart. From the chart, 3rd or more Generation immigrants have higher incomes of either 50k-100k or more than 150k, which implies that Annual Income is a confounding variable for GPA performance. However, it's worth nothing that 1st Generation immigrants are the second highest performing group in terms of GPA, despite very few of them having a family income of over 150k.




### Generational Status vs. GPA

In [21]:
aidGPA = pd.crosstab(pqdf.GPA, pqdf.FinAid)
fig12 = plx.density_heatmap(pqdf, x = pqdf.FinAid, y=pqdf.GPA, text_auto=True, 
    category_orders={"FinAid":['No, Yes'],
                "GPA":['3.5 - 4.0','3.0 - 3.49','2.5 - 2.99','2.0 - 2.49','1.5 - 1.99','0 - 1.49']},
                color_continuous_scale="Plasma",width=1000)
fig12.show()

In [22]:
aidGPA = pd.crosstab(pqdf.GPA, pqdf.FinAid, normalize="columns")
fig13 = plx.imshow(aidGPA, origin = "lower")
fig13.show()

From the analysis, we can see that students who receieve Financial Aid have a greater number of people with higher GPA's than those who receive no Financial Aid. However, when adjusting for the fact that those who receive financial aid constiute a higher percentage of students as a whole, we can find out the GPA's as a percentage of a student's financial aid status. This reveals that students who receive no Financial Aid have a more concentrated number of students at the higher GPA's than those students who do receive financial aid.

## Hypotheses and Testing

We would like to test the following hypotheses:

* There exists a relationship between a student's household income and their GPA.
* There exists a correlation between a student's household income and their satisfaction with the college experience.
* There exists a relationship between a student's parents' highest degree achieved, and the student's GPA.
* There exists a relationship between a student's immigrant generational status, and their own GPA

### Annual Income vs. GPA

To test the relationship between household annual income and GPA, we utilize the Chi-Squared test to unveil any statistical significance between these two categorical variables.  

In [17]:
#chi squared
annualtogpa = pd.crosstab(pqdf["AnnualIncome"], pqdf["GPA"])
c, p, dof, expctd = chi2_contingency(annualtogpa)
p, dof

(0.2597755179205747, 8)

* Null/Alternative Hypothesis: TODO
* Alpha Value: 
* Test/Analysis:
* Conclusion: (reject/accept)

### Manual Verification (Annual Income vs. GPA)

For our toy dataset toy_data.csv, we took 8 elements to perform chi-squared and correlation analysis on by hand to test the code we wrote to perform the analyses.

Note: Because we only sampled n = 8 elements for our toy data set, we are likely to not have the most representative dataset, so the "conclusion" we draw from manually analyzing the toy dataset will likely be different from that of our final one. However, the calculation method should be the same and our code would produce the same output.

#### χ2 (Chi-Squared) 

To test our χ2 analysis, I will be using the toy dataset to perform our analysis on Family Income -> GPA by hand. Here is our data...

|            | 1.0 - 1.49 | 1.5 - 2.49 | 2.5 - 2.99 | 3.0 - 3.49  | 3.5 - 4.0 | Total |
|------------|------------|------------|------------|-------------|-----------|-------|
| < 50k      | 0          | 0          | 1          | 1           | 2         | 4     |
| 50k - 150k | 0          | 0          | 0          | 2           | 0         | 2     |
| > 150k     | 0          | 0          | 0          | 2           | 0         | 2     |
| Total      | 0          | 0          | 1          | 5           | 2         |       |

Here's our calculations for χ2

$$\chi^{2}=\sum \frac{\left({O}_{i}-E_{i}\right)^{2}}{E_{i}}$$
$$ {e}_{(2.5, <50k)} = 1 * 4 / 8 = 0.5$$
$$ {e}_{(3.0, <50k)} = 5 * 4 / 8 = 20 / 8 = 2.5 $$
$$ {e}_{(3.5, <50k)} = 2 * 4 / 8 = 8 / 8 = 1 $$
$$ {e}_{(2.5, 50-150k)} = 1 * 2 / 8 = 2/8 = 0.25 $$
$$ {e}_{(3.0, 50-150k)} = 5 * 2 / 8 = 10 / 8 = 1.25 $$
$$ {e}_{(3.5, 50-150k)} = 2 * 2 / 8 = 4/8 = 0.5 $$
$$ {e}_{(2.5, 150k+)} = 1 * 2 / 8 = 2/8 = 0.25 $$
$$ {e}_{(3.0, 150k+)} = 5 * 2 / 8 = 10 / 8 = 1.25 $$
$$ {e}_{(3.5, 150k+)} = 2 * 2 / 8 = 4/8 = 0.5 $$
$$ \chi^{2} = \frac{(1 - 0.5)^2}{0.5}  + \frac{(1 - 2.5)^2}{2.5} + \frac{(2 - 1)^2}{1} + \frac{(0 - 0.25)^2}{0.25} + \frac{(2 - 1.25)^2}{1.25} + \frac{(0 - 0.5)^2}{0.5} + \frac{(0 - 0.25)^2}{0.25} + \frac{(2 - 1.25)^2}{1.25} + \frac{(0 - 0.5)^2}{0.5} $$
$$ \chi^{2} = 4.8$$

### Annual Income vs. College Experience Satisfaction

We use both the Chi-Square and correlation tests to observe whether or not there exists any relationship between a student's annual household income and the student's satisfaction with their college experience.

In [18]:
# Chi-Square
annualtofeel = pd.crosstab(pqdf["AnnualIncome"], pqdf["CurrentFeelCol"])
c, p, dof, expctd = chi2_contingency(annualtofeel)
p, dof

(0.08695852315959544, 8)

In [19]:
# Correlation for annual income vs college experience as chi squared null hypothesis has been rejected
annualfeelrho = np.corrcoef(pqdf["AnnualIncome_Category"], pqdf["CurrentFeelCol"])
annualfeelrho

array([[1.        , 0.06743329],
       [0.06743329, 1.        ]])

Null Hypothesis:
There is no correlation between annual income and GPA
There is no coorelation between annual income and Satisfaction with your college experience

Now lets do some chi-squared analysis to determine if there is a correlation between annual income and GPA. For annual income vs college experience there is a p value of about .34 meaning if we reject the null hypothesis there is an 34% chance there is no correlation. As this is too large of a number we cannot successfully reject the null hypothesis for annual income vs GPA

Now lets do some chi-squared analysis to determine if there is a correlation between annual income and Satisfaction with current college experience. For annual income vs college experience there is a p value of about .08 meaning if we reject the null hypothesis, there is an 8
 chance that there is no correlation. This number is greater than our passing value of .05 so we can't reject our null hypothesis.

Despite not passing the Chi-squared test, our P-value was rather small, so we decided to run correlation regardless. I found that there was a slight positive correlation of .07. It's practically negligable but this is due to the fact we have categorical data more than numerical data but there is still appears to be a positive correlation nonetheless.

### Manual Verification (Annual Income vs. Satisfaction)

Once more, we'll perform a manual verification on the toy dataset. The counts for Family Annual Income vs College Satisfaction on the toy dataset are as follows...

|            | 1 (Very unsatisfied) | 2 (Unsatisfied) | 3 (Neutral) | 4 (Satisfied) | 5 (Very Satisfied) | Total |
|------------|----------------------|-----------------|-------------|---------------|--------------------|-------|
| < 50k      | 0                    | 0               | 2           | 2             | 0                  | 4     |
| 50k - 150k | 0                    | 0               | 1           | 1             | 0                  | 2     |
| > 150k     | 0                    | 0               | 1           | 1             | 0                  | 2     |
| Total      | 0                    | 0               | 4           | 4             | 0                  | 8     |

Here's our calculations for χ2

$$\chi^{2}=\sum \frac{\left({O}_{i}-E_{i}\right)^{2}}{E_{i}}$$
$$ {E}_{(Neutral, <50k)} = 4 * 4 / 8 = 16 / 8 = 2$$
$$ {E}_{(Satisfied, <50k)} = 4 * 4 / 8 = ... = 2 $$
$$ {E}_{(Neutral, 50-150k)} = 4 * 2 / 8 = 1 $$
$$ {E}_{(Satisifed, 50-150k)} = 4 * 2 / 8 = 1 $$
$$ {E}_{(Neutral, 150k+)} = 4 * 2 / 8 = 1 $$
$$ {E}_{(Satisfied, 150k+)} = 4 * 2 / 8 = 1 $$
$$ \chi^{2} = \frac{(2 - 2)^2}{2} + \frac{(2 - 2)^2}{2} + \frac{(1 - 1)^2}{1} + \frac{(1 - 1)^2}{1} + \frac{(1 - 1)^2}{1} + \frac{(1 - 1)^2}{1}$$
$$ \chi^{2} = 0 $$

And for the correlation test...

We treat the 3 income groups as bins, represented with an integer value. The <50k group is treated as a value of 1, 50k - 150k as 2, and >150k as 3. Satisfaction is already fine, since it's being represented as an integer from 1-5.

From our toy dataset, here is our set of points for our n=8 samples, using the above mapping...
| Satisfaction | Income |
|--------------|--------|
| 3            | 2      |
| 4            | 2      |
| 3            | 1      |
| 4            | 1      |
| 4            | 1      |
| 3            | 1      |
| 3            | 3      |
| 4            | 3      |

Calculations
$$r_{xy} = \frac{\sum{(x_i - x_{avg})(y_i - y_{avg})}}{\sqrt{\sum{(x_i - x_{avg})^2 \sum(y_i - y_{avg})^2}}}$$
$$x_{avg} = (3 + 4 + 3 + 4 + 4 + 3 + 3 +4) / 8 = 3.5$$
$$y_{avg} = (2 + 2 + 1 + 1 + 1 + 1 + 3 + 3) / 8 =  1.75$$

$$\sum{(x_i - x_{avg})(y_i - y_{avg})} = (3 - 3.5)(2 - 1.75) + (4 - 3.5)(2 - 1.75) + (3 - 3.5)(1 - 1.75) + (4 - 3.5)(1 - 1.75) + (4 - 3.5)(1 - 1.75) + (3 - 3.5)(1 - 1.75) + (3 - 3.5)(3 - 1.75) + (4 - 3.5)(3 - 1.75) = 0$$

Because the numerator is 0, in this case our manually calculated correlation value is 0 (so it doesn't matter what the denominator is, but if I were to expand it out I would have solved for those two sums).

The correlation value of zero as well as the chi-squared value of 0 from the manual analysis of our toy data set is likely coming from the fact that the toy dataset has a very small size (n = 8); Our automated analysis showed a non-zero correlation value and low chi-squared, however those values were both low enough that it is more likely that the null hypothesis of there being no connection is true.

### Parent's Educational Attainment vs. Student's GPA

To test if there exists a significant relationship between a student's parents' educational attainment and the student's GPA, we proceed by using the Chi-Square method of testing.

#### Null Hypothesis

There is no statistically significant relationship between a student's parents' highest degree obtained with respect to the student's GPA in college.

#### Alpha Value

We choose our alpha value to be 0.05, which indicates the maximum probability of rejecting the null hypothesis. With 20 degrees of freedom, we need a Chi-Square statistic that exceeds 31.410.

In [20]:
# chi squared test for highest parent education achieved vs. gpa
edutogpa = pd.crosstab(pqdf["ParentEdu"], pqdf["GPA"])
c, p, dof, expctd = chi2_contingency(edutogpa)
c, p, dof

(29.16962991403458, 0.08447264527198481, 20)

#### Analysis + Conclusion

The test reveals that our Chi-Square statistic 29.169 is less than 31.410, which is the Chi-Square statistic that we needed to exceed. Therefore, we cannot reject the null hypothesis, and we conclude that there is no relationship between the highest degree obtained by either parent and the student's GPA.

### Manual Verification (Parents' Education vs. GPA)

Once more, we'll perform a manual verification on the toy dataset. The counts for Parents' Education vs Student GPA on the toy dataset are as follows...

|             | 2.5 - 2.99 | 3.0 - 3.49 | 3.5 - 4.0 | Total |
|-------------|------------|------------|-----------|-------|
| High school | 1          | 1          | 1         | 3     |
| Bachelors   | 0          | 1          | 1         | 2     |
| Masters     | 0          | 2          | 0         | 2     |
| PhD         | 0          | 1          | 0         | 1     |
| Total       | 1          | 5          | 2         | 8     |

Here's our calculations for χ2

$$\chi^{2}=\sum \frac{\left({O}_{i}-E_{i}\right)^{2}}{E_{i}}$$
$$ {E}_{(High school, 2.5-2.99)} = 3 * 1 / 8 = 3 / 8 = 0.375 $$
$$ {E}_{(High school, 3.0-3.49)} = 3 * 5 / 8 = 15 / 8 = 1.875 $$
$$ {E}_{(High school, 3.5-4.0)} = 3 * 2 / 8 = 6/8 = 0.75 $$
$$ {E}_{(Bachelors, 3.0-3.49)} = 2 * 5 / 8 = 10/8 = 1.25 $$
$$ {E}_{(Bachelors, 3.5-4.0)} = 2 * 2 / 8 = 4/8 = 0.5 $$
$$ {E}_{(Masters, 3.0-3.49)} = 2 * 5 / 8 = 10/8 = 1.25 $$
$$ {E}_{(PhD, 3.0-3.49)} = 1 * 5 / 8 = 5/8 = 0.625 $$

$$ \chi^{2} = \frac{(1 - 0.375)^2}{0.375} + \frac{(1 - 1.875)^2}{1.875} + \frac{(1 - 0.75)^2}{0.75} + \frac{(1 - 1.25)^2}{1.25} + \frac{(1 - 0.5)^2}{0.5} + \frac{(2 - 1.25)^2}{1.25} + \frac{(1 - 0.625)^2}{0.625} $$
$$ \chi^{2} = 2.7583 $$

Just like in our automated calculation, our chi-squared value does not exceed the threshold needed to be able to reject the null hypothesis.