# What's it like to be a 10th grader?
<hr>

## Outline
<ol>
<li>Introduction
    <ul>
    <li>1.1 Background</li>
    <li>1.2 Libraries Used</li>
    </ul>
</li>
<li>Data Collection
    <ul>
    <li>2.1 Data Source</li>
    <li>2.2 About the Data</li>
    <li>2.3 Using the Data in Python</li>
    </ul>
</li>
<li>Exploratory Analysis & Data Visulation</li>
    <ul>
        <li>3.1 Total Hours Spent Doing Homework</li>
        <li>3.2 Data Processing</li>
    </ul>
<li>Machine Learning Analysis</li>
<li>Discussion of Findings</li>
<li>Citations</li>
</ol>

## 1: Introduction

Every chilly November morning I had my first class of the day in a brick wall class room. This was my honors 10th grade english class--most people in this class were at least fairly literate. Even so, half of the class would be nodding off or spacing out when the teacher asks about whether or not the gerund up on the projector was grammatical in the sentence, or why Gatsby was obsessed with the color green, or how you should use MLA citations.
However one day, we got the teacher off track enough that we were 30 minutes into class and she didn't feel like teaching anymore (she must have been tenured, that lady was not fully mentally anyways) so she decided to go around the room and ask everyone where they saw themselves in five years. Mostly I got to hear my best friend talk about how she'll be in law school, and my friend from middle school discuss how he is already starting to work in his dad's plumbing business. I'm not paying attention much; it is 8am after all.

But I definitely tune back in when I hear the quiet kid say:
"Probably in the back of some van."
That was a very brash response, but some people do want to live out of vans and travel the country. So as any decent person would, the teacher says he probably wants to do it to travel or play music and be in a band, right?
He responds
"No, hopefully not breathing."
Uhh.... Obviously I look at my friend next to me and everyone else in the class is making eyes at each other as well. This teacher, not having much mental hold of herself on a normal day already, says with an uncomfortable chuckle "that's morbid" and then we move onto the next person.

Fortunately, it's been about five years and that kid is still breathing and not in the back of some van. He came into the restaurant I was a server at once with his family and they all seemed very nice. However, unlike many of the other kids in that class, he did not attempt to pursue any higher education and I believe he is now practicing Buddism daily.
Sometimes I wonder if had he had a better experience in life overall that year if maybe he would have at least gone to community college, or applied to some small schools the next fall that would have given him a scholarhsip. There is nothing wrong with practicing Buddism, though he was intelligent and could have definitely been closer to a stable career path by now if he had a plan for his future in high school.


We all either have been or will be 10th graders at one point in life. Maybe that was the year you finally joined cross country, the year you and your best friend became no longer friends, or the year you asked your crush to the dance and they said yes.
Sophomore year of high school for many hosts the creation of memories to last a lifetime. Though as you are growing socially and in real-life experience, you are also at a cornerstone for making decisions about colleges, majors, and careers. The memories garnered during 10th grade will be the evidence to shape a sense of self and percieved place in life which determine your decisions about where you see yourself in 5 and 20 years.

### 1.1 Background
There are many proposed factors influencing what careers paths people will end up with. For example, the aptitudes, attitudes, and expectations of peers and parents, how well a student did in previous classes, how much recognition a student gets for good work, or their socioeconomic background.
The University of Michigan has conducted a <a href="https://www.lsay.org/about.html">logitudinal study</a> following students and their parents from 7th to 12th grade, where each year they answer a set of questions related to school and other questions about their lives and opinions.
They provide the full dataset from these surveys, including questions about the amount of homework they do for a class, their opinions on specific subjects, teacher gender, and summer activities.

### 1.2 Libraries Used
<ul>
<li>Pandas: Displaying and organizing dataframes</li>
<li>Numpy: Conducting operations on data</li>
<li>Matplotlib: Creating plots of data</li>
<li>Scikit-learn: Predictive Modeling</li>
</ul>

In [1]:
import pandas as pd
import numpy as np

## 2: Data Collection

### 2.1 Data Source
My data comes from the <a href="https://www.lsay.org/about.html">Longitudinal Study of American Youth (LSAY)</a> conducted by the Univeristy of Michigain. [1] The actual data was downloaded from <a href="https://www.icpsr.umich.edu/web/ICPSR/studies/30263?q=LSAY">IPCSR</a>, an archive of social science datasets.

### 2.2 About the Data
This dataset contains 11,904 columns, a few corresponding to either demographics and identifiers, and most containing the response from each survey question. Each row is the answer for a student (case), and that row includes all questions answered throughout the study and are not grouped by years (2014-2017). A difference in year is denoted by the letters in the column code. A full description of the data is available in the dataset's <a href="../files/30263-0001-Codebook-ICPSR.pdf">codebook</a>.
There are a total of 5,945 cases in the dataset. Students were interviewed once in the fall and once in the spring every year between 7th and 12th grade, and their parents were interviewed once every year.

### 2.3 Using the Data in Python
First, the data must be loaded into python. Pandas has a method which reads a stata file into a dataframe, and stata was the filetype of the data I'm using (denoted by the .dta extension). Pandas also has methods for reading in <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_spss.html">SPSS</a> and <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html">CSV</a> files.

In [2]:
all_data = pd.read_stata('long_study_school.dta', convert_categoricals=False)
all_data.head()

Unnamed: 0,CASENUM,COHORT,SCHOOLID,STRATA,ASCICLS,ASCITCH,ASTSEX,AMTHCLS,AMTHTCH,AMTSEX,...,PEDUC3,MEDSRCE,FEDSRCE,MOTHOCC,FATHOCC,POCI,FOCCSRCE,MOCCSRCE,MOTHSEI,FATHSEI
0,1001,1,309,6,-95,-95,-95,-95,-95,-95,...,1,2,3,395,785,0,5,4,34,15
1,1002,2,132,11,132032,13203,1,132061,13206,2,...,4,7,7,-99,-99,-99,-99,-99,-99,-99
2,1003,1,309,6,-95,-95,-95,-95,-95,-95,...,1,3,2,988,535,0,4,5,-98,27
3,1004,2,126,8,126026,12602,2,126101,12610,2,...,1,8,8,270,331,0,11,11,55,53
4,1005,2,133,11,-99,-99,-99,133024,13302,1,...,4,3,7,65,471,0,2,3,87,32


Looking at the results of the head() function, this dataset has 11,904 columns, which is far more than will be useful for a single exploratory data analysis project. In order to get a better idea as to what data I may look at, I skimmed the associated codebook, which gives a description for each column name.

## 3: Exploratory Data Analysis & Data Visulation

### 3.1 Total Hours Spent Doing Homework
I decided to first explore the hours of homework completed by 10th graders, since I feel like thsi could give good insight into attitudes about life and school.
I plan to create a sum of all the homework hours per week across all subjects, adding 0 if there is a response of -99 (which according to the codebook, indicates that they did not take a respective course during that term) and recording if the data is blank (-98) or they did not participate in that question (-95). From this, I can look to see if there are any common outcomes predicted by the number of hours of homework each week. 


This code filters the table to only include columns which correspond to the appropriate codes for the number of hours of homework 
reported by 10th graders.
The first letter is either G or H, meaning that this includes the reponses from the 10th graders in the fall (G) and in the spring (H).
The next letter must be A after the first, because the 'A' group asks questions directly related to schoolwork.
Next, there is a 3 character code related to the subject (for example, MTH for Math and COM for Computer). This is matched as any 
three characters in the regular expression.
After this three character subject marker, there may or may not be a digit appearing afterwards. This is present when there are more 
than one of a subject that a student can be enrolled in during that semester, and in that case the first class will have a 1 added 
and the second a 2. If the regular expression does not look for this number, any subjects with more than one option would be excluded.
Finally, the column code needs to end in the letter J, as this is the code in section A that asks how many hours of homework a week
that students has for that subject.
When I use regular expressions, I find it easier to test them first online before trying to run it in code, that way if it is not working I know it was the regular expression specifically. Some websites also show you how much was matched and if any groups were matched; I like <a href="https://regex101.com/">regex101.com</a>.

The CASENUM column should also be included in order to maintain which student the questions correspond to in case the original dataframe is to be referenced later.

In [3]:
tenth_hours = all_data.filter(regex=("^([GH]A.{3}[12]?J)|(CASENUM)$"))
# Note, the filter method does not remove the columns which don't make it past the filter, 
# so this is stored as a new dataframe, tenth_hours.

tenth_hours.head()

Unnamed: 0,CASENUM,GAMTH1J,GAMTH2J,GASCI1J,GASCI2J,GAENG1J,GAENG2J,GASSTJ,GACOMJ,GAFORJ,...,HASCI1J,HASCI2J,HAENG1J,HAENG2J,HASSTJ,HACOMJ,HAFORJ,HAARTJ,HAMUSJ,HAVOCJ
0,1001,1,-99,3,-99,1,-99,-99,-99,-99,...,2,-99,1,-99,2,-99,-99,-99,-99,0
1,1002,5,-99,3,-99,3,-99,-99,-99,-99,...,5,-99,1,-99,-99,-99,-99,0,-99,-99
2,1003,3,-99,4,-99,3,-99,-99,-99,4,...,4,-99,4,-99,-99,-99,3,-98,-99,-99
3,1004,-95,-95,-95,-95,-95,-95,-95,-95,-95,...,-95,-95,-95,-95,-95,-95,-95,-95,-95,-95
4,1005,1,-99,1,-99,1,-99,-99,-99,1,...,1,-99,0,-99,-99,-99,1,-99,1,-99


By using the head() function, we can confirm that we did retrieve the desired columns. Another option would be to print the columns method of the dataframe and check that that includes all the codes I wanted to find. 
We also can double check that there is an appropriate number of columns; 25. Looking at the codebook, this is indeed the number of codes that are questions asked to 10th graders about the number of hours per week spent on homework, plus one for the case number column.

Although this filtered dataset contains considerably less data than the original dataset, it is still not easily interpreatble. 
Because there are already so many columns in the filtered dataset, we can create a new dataframe which contains the case number (which allows cross-referencing between tables) and for now, the number of hours total spent per week across all subjects doing homework, the number of classes a student was taking, and the number of questions a student skipped or did not participate in.
Including the information about why a question was not answered could potentially give insight into the accuracy of the conclusions of the data, and it may also allow a case to be excluded altogether if they did not participate in any of the questions (sum is 24 as there are 24 codes observed).
This new dataframe could also be used later to store more statistics about a student. The number of classes could help determine an average time per class, or point to a student who has less hours because the given subjects do not describe their actual schedule well.

In [4]:
statistics = pd.DataFrame(columns=['CASENUM', 'HOURSTOTAL', 'SKIPS', 'NONPART', 'NUMENROLLED'])


# Getting the column names for the full tenth grade datset.
# The values at CASENUM are not answers to a question and is removed from this list
columns = list(tenth_hours)
columns.remove('CASENUM')

for index, row in tenth_hours.iterrows():
    total_hours = 0
    skips = 0
    nonpart = 0
    num_enrolled = 0

    # iterate over each column name in the dataframe
    for code in columns:
        if row[code] == -98:
            # response is blank (no response)
            skips += 1
            
        elif row[code] == -95:
            # student did not participate in this question
            nonpart += 1

        elif row[code] != -99:
            num_enrolled += 1
            total_hours += row[code]
        
    statistics.loc[len(statistics.index)] = [row['CASENUM'], total_hours, skips, nonpart, num_enrolled]

statistics.head()

Unnamed: 0,CASENUM,HOURSTOTAL,SKIPS,NONPART,NUMENROLLED
0,1001,10,0,0,8
1,1002,19,0,0,8
2,1003,28,1,0,9
3,1004,0,0,24,0
4,1005,8,0,0,10


Next, the cases that did not participate in any questions should be removed since they do not record any of the data we are looking for. This happens when the non-participation value is 24.

In [5]:
# This code selects the subset of the dataframe which meets the given condition: 
# that NONPART is not 24, effectively removing any rows which do have a non-particpation count of 24.
statistics = statistics[statistics.NONPART != 24]

statistics.head()

Unnamed: 0,CASENUM,HOURSTOTAL,SKIPS,NONPART,NUMENROLLED
0,1001,10,0,0,8
1,1002,19,0,0,8
2,1003,28,1,0,9
4,1005,8,0,0,10
6,1007,30,0,0,10


## Citations

[1] Miller, J. D. (2021). *Longitudinal Study of American Youth, 1987-1994, 2007-2011, 2014-2017* [Data set]. doi:10.3886/ICPSR30263.v7