# Problem 7: Reading Test Scores

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. 
This test provides a quantitative way to compare the performance of students from different parts of the world. 

The datasets *pisa2009.csv* contains information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES).

In [2]:
# load the data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Theory/master/Data/pisa2009.csv'
data = pd.read_csv(url)
data

Unnamed: 0,grade,male,raceeth,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,fatherBachelors,...,englishAtHome,computerForSchoolwork,read30MinsADay,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,publicSchool,urban,schoolSize,readingScore
0,11,1,,,0.0,,,1.0,,,...,0.0,1.0,0.0,225.0,,1.0,1,1,673.0,476.00
1,11,1,White,0.0,0.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,450.0,25.0,1.0,1,0,1173.0,575.01
2,9,1,White,1.0,1.0,1.0,1.0,1.0,1.0,,...,1.0,1.0,0.0,250.0,28.0,1.0,1,0,1233.0,554.81
3,10,0,Black,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,1.0,1.0,1.0,200.0,23.0,1.0,1,1,2640.0,458.11
4,10,1,Hispanic,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,1.0,1.0,1.0,250.0,35.0,1.0,1,1,1095.0,613.89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3658,9,1,White,0.0,1.0,1.0,,0.0,1.0,1.0,...,1.0,1.0,0.0,250.0,20.0,1.0,1,0,421.0,509.99
3659,9,1,White,0.0,0.0,1.0,0.0,1.0,1.0,0.0,...,1.0,0.0,1.0,450.0,16.0,1.0,1,0,1317.0,444.90
3660,10,1,Hispanic,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,225.0,16.0,1.0,1,1,539.0,476.89
3661,11,1,Black,0.0,0.0,1.0,0.0,,,0.0,...,1.0,1.0,0.0,54.0,36.0,1.0,1,1,,363.61


**Data Description**

| Feature | Description |
| :- | -: |
| grade | The grade in school of the student (most 15-year-olds in America are in 10th grade) |
| male | Whether the student is male (1) or female (0) |
| raceeth | The race/ethnicity composite of the student |
| preschool | Whether the student attended preschool (yes - 1, no - 0) |
| expectBachelors | Whether the student expects to obtain a bachelor's degree (yes - 1, no - 0) |
| motherHS | Whether the student's mother completed high school (yes - 1, no - 0) |
| motherBachelors | Whether the student's mother obtained a bachelor's degree (yes - 1, no - 0) |
| motherWork | Whether the student's mother has part-time or full-time work (yes - 1, no - 0) | 
| fatherHS | Whether the student's father completed high school (yes - 1, no - 0) |
| fatherBachelors | Whether the student's father obtained a bachelor's degree (yes - 1, no - 0) |
| fatherWork | Whether the student's father has part-time or full-time work (yes - 1, no - 0) | 
| selfBornUS | Whether the student was born in the United States of America (yes - 1, no - 0) |
| motherBornUS | Whether the student's mother was born in the United States of America (yes - 1, no - 0) |
| fatherBornUS | Whether the student's father was born in the United States of America (yes - 1, no - 0) |
| englishAtHome | Whether the student speaks English at home (yes - 1, no - 0) |
| computerForSchoolwork | Whether the student has access to a computer for schoolwork (yes - 1, no - 0) |
| read30MinsADay | Whether the student reads for pleasure for 30 minutes/day (yes - 1, no - 0) |
| minutesPerWeekEnglish | The number of minutes per week the student spend in English class |
| studentsInEnglish | The number of students in this student's English class at school |
| schoolHasLibrary | Whether this student's school has a library (yes - 1, no - 0) |
| publicSchool | Whether this student attends a public school (yes - 1, no - 0) |
| urban | Whether this student's school is in an urban area (yes - 1, no - 0) |
| schoolSize | The number of students in this student's school |
| readingScore | The student's reading score, on a 1000-point scale |

Some columns contain missing data

In [4]:
# missing data
data.isnull().sum()

grade                      0
male                       0
raceeth                   35
preschool                 56
expectBachelors           62
motherHS                  97
motherBachelors          397
motherWork                93
fatherHS                 245
fatherBachelors          569
fatherWork               233
selfBornUS                69
motherBornUS              71
fatherBornUS             113
englishAtHome             71
computerForSchoolwork     65
read30MinsADay            34
minutesPerWeekEnglish    186
studentsInEnglish        249
schoolHasLibrary         143
publicSchool               0
urban                      0
schoolSize               162
readingScore               0
dtype: int64

So  we will remove all rows with missing data

In [5]:
# remove rows with missing data
data.dropna(axis=0,how='any',inplace=True)

In [6]:
# check that there is no missing data
data.isnull().sum()

grade                    0
male                     0
raceeth                  0
preschool                0
expectBachelors          0
motherHS                 0
motherBachelors          0
motherWork               0
fatherHS                 0
fatherBachelors          0
fatherWork               0
selfBornUS               0
motherBornUS             0
fatherBornUS             0
englishAtHome            0
computerForSchoolwork    0
read30MinsADay           0
minutesPerWeekEnglish    0
studentsInEnglish        0
schoolHasLibrary         0
publicSchool             0
urban                    0
schoolSize               0
readingScore             0
dtype: int64

The **goal** is to fit a linear regression model to predict the reading score (variable readingScore), using the remaining variables.