<a href="https://colab.research.google.com/github/ttevhide/Programming-for-Data/blob/main/Numpy_mini_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Investigate, wrangle, add new data to the dataframe, filter and correlate

The following data file contains data about student scores in math, reading and writing.  The exam scores are assumed to be percentages. It also contains data about gender, ethnicity, parental education, whether the student qualifies for free school food and whether or not the student has taken a preparation course for the exams.  The data set has already been cleaned. 

1.  Investigate the data set.

2.  Create numpy arrays to hold each of the three sets of scores.  
Create a new numpy array to hold the average exam score (of the three scores for each student)
Add the new numpy array as a new column in the dataframe.

3.  Filter the original dataset into a new dataframe containing just the females.  Calculate the average exam scores for all rows in this new dataframe and then find the mean of the average column.

4.  Do the same for the males.

5.  Use the original dataset to find the correlation coefficient for reading and maths.  How closely do they correlate?  Write what you find in a text box below the code.

6.  Do the same for reading and writing. What do you find?

7.  You might want to filter on different criteria and check correlation (e.g. those on free school meals, or those who had prepared, etc)







The dataset can be accessed here:  https://raw.githubusercontent.com/lilaceri/Working-with-data-/main/Data%20Sets%20for%20code%20divisio/exams.csv.  This is a .csv file.

**NOTE:** Some useful references are included at the bottom of this spreadsheet.

Use the code cell below to work your code.

In [27]:
import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/lilaceri/Working-with-data-/main/Data%20Sets%20for%20code%20divisio/exams.csv"
df = pd.read_csv(url)

math_score=df["math_score"].to_numpy(np.int16)
reading_score=df["reading_score"].to_numpy(np.int16)
writing_score=df["writing_score"].to_numpy(np.int16)

df["average_score"] =np.array(df[["math_score", "reading_score", "writing_score"]].mean(axis=1))
display(df.head())
print("")

def create_newdf(df, col, condition=None):
  new_df = df[df[col]== condition]
  col_mean = round(new_df['average_score'].mean())
  return new_df, col_mean

female_df, female_mean = create_newdf(df, 'gender', 'female')
male_df, male_mean = create_newdf(df, 'gender', 'male')

display(female_df.head())
print("Girls average score: " ,female_mean)
print("")

display(male_df.head())
print("Boys average score: " ,male_mean)
print("")


Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score,average_score
0,female,group E,some college,free/reduced,none,65,76,71,70.666667
1,male,group C,some college,standard,completed,75,72,69,72.0
2,female,group B,some high school,free/reduced,completed,62,56,61,59.666667
3,male,group D,some high school,standard,completed,60,60,59,59.666667
4,female,group C,high school,free/reduced,completed,34,54,55,47.666667





Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score,average_score
0,female,group E,some college,free/reduced,none,65,76,71,70.666667
2,female,group B,some high school,free/reduced,completed,62,56,61,59.666667
4,female,group C,high school,free/reduced,completed,34,54,55,47.666667
5,female,group B,some college,free/reduced,none,53,66,61,60.0
6,female,group D,some college,standard,none,85,88,92,88.333333


Girls average score:  71



Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score,average_score
1,male,group C,some college,standard,completed,75,72,69,72.0
3,male,group D,some high school,standard,completed,60,60,59,59.666667
7,male,group E,some high school,standard,none,93,73,70,78.666667
9,male,group C,associate's degree,standard,completed,85,75,82,80.666667
13,male,group C,some college,standard,none,90,81,81,84.0


Boys average score:  66



In [20]:
print("Correlation Coefficient of Maths and Reading:")
display(np.corrcoef(x=reading_score, y=math_score))
print("")
print("Correlation Coefficient of Reading and Writing:")
display(np.corrcoef(x=reading_score, y=writing_score))

Correlation Coefficient of Maths and Reading:


array([[1.        , 0.81459752],
       [0.81459752, 1.        ]])


Correlation Coefficient of Reading and Writing:


array([[1.        , 0.95676863],
       [0.95676863, 1.        ]])

Seems there are strong correlations between maths and reading with 0.81, and between reading and writing with 0.96(stronger)


### Helpful references
---

Filtering on criteria using df references:  
`filtered_df = df[df['column name' == value]]`  

Filtering on multiple criteria using df reference:  
`filtered_df = df[df['first column name' == value] & df[second column name] >= value]`

Numpy help sheet:  http://datacamp-community-prod.s3.amazonaws.com/da466534-51fe-4c6d-b0cb-154f4782eb54



In [28]:
free_lunch_df, free_lunch_mean = create_newdf(df, 'lunch', 'free/reduced')
standard_lunch_df, standard_lunch_mean = create_newdf(df, 'lunch', 'standard')

display(free_lunch_df.head())
print("Free school meal average score:", free_lunch_mean)
print('')
display(standard_lunch_df.head())
print("Standard school meal average score:",standard_lunch_mean)



Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score,average_score
0,female,group E,some college,free/reduced,none,65,76,71,70.666667
2,female,group B,some high school,free/reduced,completed,62,56,61,59.666667
4,female,group C,high school,free/reduced,completed,34,54,55,47.666667
5,female,group B,some college,free/reduced,none,53,66,61,60.0
11,female,group D,associate's degree,free/reduced,none,54,69,70,64.333333


Free school meal average score: 62



Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score,average_score
1,male,group C,some college,standard,completed,75,72,69,72.0
3,male,group D,some high school,standard,completed,60,60,59,59.666667
6,female,group D,some college,standard,none,85,88,92,88.333333
7,male,group E,some high school,standard,none,93,73,70,78.666667
8,female,group C,associate's degree,standard,completed,58,71,63,64.0


Standard school meal average score: 72


In [31]:
math_score=df["math_score"].to_numpy(np.int16)
math_score_fl = free_lunch_df["math_score"].to_numpy(np.int16)
reading_score_fl= free_lunch_df["reading_score"].to_numpy(np.int16)
writing_score_fl= free_lunch_df["writing_score"].to_numpy(np.int16)

print("Correlation Coefficient of Maths and Reading:")
display(np.corrcoef(x=reading_score_fl, y=math_score_fl))
print("")
print("Correlation Coefficient of Reading and Writing:")
display(np.corrcoef(x=reading_score_fl, y=writing_score_fl))


Correlation Coefficient of Maths and Reading:


array([[1.        , 0.79602045],
       [0.79602045, 1.        ]])


Correlation Coefficient of Reading and Writing:


array([[1.        , 0.95514705],
       [0.95514705, 1.        ]])