## Mini Project 1: 

### Description

This project aims to observe the relationships between various students' socio-economic circumstances and overall satisfaction regarding the college experience. Exploring the relationships between said circumstances and college satisfaction allows us to conclude any impact or significance regarding these variables.

In [1]:
import pandas as pd
from scipy.stats import chi2_contingency
import numpy as np
import plotly.express as plx
import plotly.graph_objects as go
import seaborn as sns

df = pd.read_csv('data.csv')

# potential questions data frame, cleaning up column names
pqdf = df.iloc[:, [2, 3, 7, 14, 17, 23, 31, 32, 37, 38, 50, 51, 52, 80]]
pqdf = pqdf.rename(columns={
    pqdf.columns[0]: 'Age',
    pqdf.columns[1]: 'Gender',
    pqdf.columns[2]: 'GenImmigrant',
    pqdf.columns[3]: 'CollegeYear',
    pqdf.columns[4]: 'GPA',
    pqdf.columns[5]: 'CurrentFeelCol',
    pqdf.columns[6]: 'GenStudent',
    pqdf.columns[7]: 'ParentEdu',
    pqdf.columns[8]: 'LecRequired',
    pqdf.columns[9]: 'LecAttended',
    pqdf.columns[10]: 'FinAid',
    pqdf.columns[11]: 'StudentLoans',
    pqdf.columns[12]: 'AnnualIncome',
    pqdf.columns[13]: 'CurrentFeelQtr',
})
pqdf.head()

Unnamed: 0,Age,Gender,GenImmigrant,CollegeYear,GPA,CurrentFeelCol,GenStudent,ParentEdu,LecRequired,LecAttended,FinAid,StudentLoans,AnnualIncome,CurrentFeelQtr
0,22,Male,1st generation,Senior,3.0 - 3.49,3.0,1st generation (first to attend),Masters,8.0,4,Yes,No,50k - 150k,1.0
1,22,Female,1st generation,Junior,3.0 - 3.49,4.0,2nd generation (parents attended college),PhD,9.0,9,Yes,No,50k - 150k,2.0
2,27,Male,3rd generation or more,Senior,3.5 - 4.0,3.0,2nd generation (parents attended college),Bachelors,8.0,7,Yes,No,Less than 50k,4.0
3,24,Male,1st generation,Senior,3.5 - 4.0,3.0,1st generation (first to attend),High school,11.0,8,Yes,Yes,Less than 50k,3.0
4,40,Female,1st generation,Junior,0 - 1.49,3.0,1st generation (first to attend),High school,6.0,3,Yes,Yes,Less than 50k,1.0


### Questions 1 and 2
<!-- To what extent do social-economic circumstances affect one's academic performance and personal satisfaction with their college experience?
1. Gather which questions have to do with social-economic circumstances
11. Gather questions relating to college experience/happiness
111. Map questions from part 1 to part 2 -->

#### Data

The data that we are interested in aims to capture the general socio-economic background of the populous. The data collected about each participant in the survey can be categorized into the following groups:

* Socio-Economic Background
    * Age
    * Gender
    * Financial Aid amount
    * Student Loans
    * Immigrant generation
    * College generation
    * Highest level of education held by parents
    * Household annual income
* College Satisfaction/Performance
    * Cumulative GPA
    * College experience satisfaction levels
    * Feelings about current quarter
    * Stress level
    * Lectures per week
    * Lectures attended

#### Questions


Using this data set, we begin analysis on the following set of questions:

* Does one's annual household income have an effect on GPA/College Satisfaction?
* Can one's parents' educational experience achieved predict one's GPA/College Satisfaction?
* Does an average UCR CS student social economic status like age, college generation, and  have an average effect on their College Satisfaction? 

### Cleaning Dataset

In [2]:
# Handling NA, replacing with mode
pqdf.GPA.fillna(pqdf.GPA.value_counts().idxmax(),inplace=True)
pqdf.CurrentFeelCol.fillna(pqdf.CurrentFeelCol.value_counts().idxmax(),inplace=True)
pqdf.CurrentFeelCol = pqdf.CurrentFeelCol.astype(int)

# Mapping GPA to group numbers
pqdf.GPA = pd.Categorical(pqdf.GPA, ordered=True, categories=['0 - 1.49', '1.5 - 1.99', '2.0 - 2.49', '2.5 - 2.99', '3.0 - 3.49', '3.5 - 4.0'])
pqdf['GPA_Category'] = pqdf.GPA.cat.codes

In [3]:
pqdf.ParentEdu = pqdf["ParentEdu"].str.lower().str.capitalize()
pqdf.ParentEdu.fillna(pqdf.ParentEdu.value_counts().idxmax(),inplace=True)
pqdf.ParentEdu = pd.Categorical(pqdf.ParentEdu)
pqdf['ParentEdu_Category'] = pqdf.ParentEdu.cat.codes
print(pqdf.ParentEdu, pqdf.ParentEdu_Category.min(), pqdf.ParentEdu_Category.max())


0         Masters
1             Phd
2       Bachelors
3     High school
4     High school
         ...     
90    High school
91      Bachelors
92            Phd
93        Masters
94      Bachelors
Name: ParentEdu, Length: 95, dtype: category
Categories (6, object): ['Associates', 'Bachelors', 'High school', 'Masters', 'Middle school', 'Phd'] 0 5


### Parent's highest degree achieved vs. GPA/College Satisfaction

In [4]:
# ryan's attempt at fixing this
fig5 = plx.density_heatmap(pqdf, x=pqdf.ParentEdu, y=pqdf.GPA, text_auto=True, 
    category_orders={"ParentEdu":['Middle school','High school','Associates','Bachelors', 'Masters','Phd'],
                "GPA":['3.5 - 4.0','3.0 - 3.49','2.5 - 2.99','2.0 - 2.49','1.5 - 1.99','0 - 1.49']},color_continuous_scale="Viridis")
fig5.show()

In [5]:
edutogpa = pd.crosstab(pqdf["ParentEdu"], pqdf["GPA"])
c, p, dof, expctd = chi2_contingency(edutogpa)
p, dof

(0.03589486566241797, 20)

In [21]:
level_list = ['depression','sadness','neutral','kinda happy','lit asf']
edu_list = ['Middle school','High school','Associates','Bachelors', 'Masters','Phd']
colors = ['rgba(38, 24, 74, 0.8)', 'rgba(71, 58, 131, 0.8)',
          'rgba(122, 120, 168, 0.8)', 'rgba(164, 163, 204, 0.85)',
          'rgba(190, 192, 213, 1)']

pcdf = pd.crosstab(pqdf.CurrentFeelCol,pqdf.ParentEdu,normalize='columns')
pcdf = pcdf.reindex(edu_list,axis="columns")
x_data=pcdf.values.tolist()
fig6 = go.Figure()
for happyindex in range(0,len(x_data)):
    fig6.add_traces([
        go.Bar(
            y=edu_list,x=x_data[happyindex],
            orientation='h', 
            name=level_list[happyindex],
            text=level_list[happyindex],
        )
    ])
fig6.update_layout(barmode='stack')
fig6.show()

ParentEdu,Middle school,High school,Associates,Bachelors,Masters,Phd
CurrentFeelCol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.0,0.0,0.0,0.0,0.076923,0.0
2,0.5,0.15,0.333333,0.090909,0.038462,0.0
3,0.25,0.25,0.0,0.393939,0.346154,0.555556
4,0.25,0.5,0.333333,0.454545,0.5,0.444444
5,0.0,0.1,0.333333,0.060606,0.038462,0.0


In [7]:
pcdf = pd.crosstab(pqdf.ParentEdu,pqdf.CurrentFeelCol, normalize = 'columns')
#pcdf = pcdf/pcdf[:].sum(axis=1)
#print(pcdf.sum(axis=1))
#pcdf
# fig7 = plx.bar(pcdf, orientation='h', color_continuous_scale="Viridis", text_auto=True)
# fig7.show()

### Annual Income vs. GPA/College Satisfaction

In [8]:
# Handling NA, replacing with mode
pqdf.AnnualIncome.fillna(pqdf.AnnualIncome.value_counts().idxmax(),inplace=True)

# Mapping Annual Income to group numbers
pqdf.AnnualIncome = pd.Categorical(pqdf.AnnualIncome)
pqdf['AnnualIncome_Category'] = pqdf.AnnualIncome.cat.codes

In [9]:
fig2 = go.Figure(data=
    go.Parcoords(
        line = dict(color = pqdf["AnnualIncome_Category"]),
        dimensions = list([
            dict(range = [0,2], 
                constraintrange = [1,2],
                tickvals = [0,1,2],   
                label = 'Annual Income', values = pqdf.loc[:, "AnnualIncome_Category"],
                ticktext = ['0-50k', '50k-150k', '150k+']),
            dict(range = [0,5],
                tickvals = [0,1,2,3,4,5],
                label = 'GPA', values = pqdf.loc[:, "GPA_Category"],
                ticktext= ["0-1.49", "1.5-1.99", "2.0-2.49", "2.5-2.99", "3.0-3.49", "3.5-4.0"]),
            dict(range = [1,5],
                tickvals = [1,2,3,4,5],
                label = 'College Satisfaction', values = pqdf.loc[:,"CurrentFeelCol"],
                ticktext = ['Dissatisfied', "Slightly Dissatisfied", "Neutral", "Somewhat Satisfied", "Satisfied"]),
        ])
    )
)
fig2.show()

In [10]:
# Create dimensions
annualIncome_dim = go.parcats.Dimension(values=pqdf.AnnualIncome, label="Annual Income")

gpa_dim = go.parcats.Dimension(values=pqdf.GPA, label="GPA")

satisfaction_dim = go.parcats.Dimension(
    values=pqdf.CurrentFeelCol, label="Satisfaction", categoryarray=[1,2,3,4,5],
    ticktext = ['Dissatisfied', "Slightly Dissatisfied", "Neutral", "Somewhat Satisfied", "Satisfied"]
)

# Create parcats trace
color = pqdf.CurrentFeelCol
fig = go.Figure(data = [go.Parcats(dimensions=[annualIncome_dim, gpa_dim, satisfaction_dim],
        line={'color': color},
        hoveron='category', hoverinfo='count+probability',
        labelfont={'size': 18, 'family': 'Times'},
        tickfont={'size': 16, 'family': 'Times'},
        arrangement='freeform')])

fig.show()

In [11]:
#chi squared
annualtogpa = pd.crosstab(pqdf["AnnualIncome"], pqdf["GPA"])
c, p, dof, expctd = chi2_contingency(annualtogpa)
p, dof

(0.338866571563568, 8)

In [12]:
annualtofeel = pd.crosstab(pqdf["AnnualIncome"], pqdf["CurrentFeelCol"])
c, p, dof, expctd = chi2_contingency(annualtofeel)
p, dof

(0.03227699973844558, 8)

Initially we just wanted to see if there was an impact on family annual income on gpa and from gpa to satisfaction. So we decided to use a parallel_coordinate graphs to see if there was a particular category that showed a specific pattern. The basic parallel coordinate graph showed an indication of such patterns specifically that people with 150k+ annual income had relativley high gpas and had a somewhat positive college experience. On the other hand, families with annual income of 50k-150k seemed to vary all over the place when it came to gpa and college experience. Families with lower than 50k annual income had high gpas but had varied college experiences. Looking at the data I decided to create a better graph that was more interactive and specific.

The more specific graph confirmed the findings for families with annual income of 150k+. For families with annual incomes of 50-150k it appears that most students had high gpas but were neutral and somewhat satisfied for the most part although there were some students that seem dissatisfied with their college experience. This is also rather normal and in lin with the initial graph. The truly surprising data was for students whose family's annual income is less than 50k. Most of the students have a relativley/positive gpa but when it came to college satisfaction something seemed up. It appears that the second most said satisfaction level, right after neutral, is slight disatisfaction with the college experience.

With all of this data in mind it appears that income may not have an effect on gpa, which is not what we expected as we thought students with more money would have access to more higher quality resources and thus a higher ration of above 3.0 or so gpa. As for the college experience category we expected that there wouldn't really be a relationship between income and satisfaction. In fact we were hoping there was a correlation between GPA and Satisfaction but this data not only shows that there is really no correlation but also that there may be a correlation between annual income and satisfaction. Specifically the lower your income the less satisfied you are with your college experience

Null Hypothesis:
There is no correlation between annual income and GPA
There is no coorelation between annual income and Satisfaction with your college experience

Now lets do some chi-squared analysis to determine if there is a correlation between annual income and GPA. For annual income vs college experience there is a p value of about .34 meaning if we reject the null hypothesis there is an 34% chance there is no correlation. As this is too large of a number we cannot successfully reject the null hypothesis for annual income vs GPA

Now lets do some chi-squared analysis to determine if there is a correlation between annual income and Satisfaction with current college experience. For annual income vs college experience there is a p value of about .03 meaning if we reject the null hypothesis there is an 3% chance there is no correlation. This number is less than our passing value of .05 so we can reject our null hypothesis and say there is a correlation between annual income and positive experience.

In [13]:
# TODO: remove this testing section once testing is complete
print(pqdf.CurrentFeelCol.min(), pqdf.CurrentFeelCol.max())
print(type(pqdf.CurrentFeelCol[0]))
print(type(pqdf.GPA_Category[0]))

1 5
<class 'numpy.int32'>
<class 'numpy.int8'>


In [14]:
print(pqdf.CurrentFeelCol.value_counts())
print(pqdf.GPA.value_counts())
print(pqdf.AnnualIncome.value_counts())

4    44
3    33
2    10
5     6
1     2
Name: CurrentFeelCol, dtype: int64
3.5 - 4.0     54
3.0 - 3.49    29
2.5 - 2.99     8
0 - 1.49       2
2.0 - 2.49     2
1.5 - 1.99     0
Name: GPA, dtype: int64
50k - 150k        44
Less than 50k     27
More than 150k    24
Name: AnnualIncome, dtype: int64


In [15]:
#print(pqdf.groupby(['AnnualIncome', 'GPA']).size().reset_index(name = 'count'))
print(pqdf.groupby(['AnnualIncome', 'CurrentFeelCol']).size().reset_index(name = 'count'))


      AnnualIncome  CurrentFeelCol  count
0       50k - 150k               1      1
1       50k - 150k               2      2
2       50k - 150k               3     19
3       50k - 150k               4     19
4       50k - 150k               5      3
5    Less than 50k               1      1
6    Less than 50k               2      7
7    Less than 50k               3      7
8    Less than 50k               4      9
9    Less than 50k               5      3
10  More than 150k               1      0
11  More than 150k               2      1
12  More than 150k               3      7
13  More than 150k               4     16
14  More than 150k               5      0


In [16]:
# testing if this moves up
# it just stays there
#can u add a cell to the end of the first question bc i do that and it goes up

In [17]:
# using this area to test 
for col in pqdf:
    print(pqdf[col].unique().size)

12
4
5
6
5
5
5
6
14
14
3
3
3
6
5
6
3
