In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Lecture 8

## Warm up: Defining and Applying Functions

In [None]:
# Load Galton's dataset of parent and child heights
galton = Table.read_table('data/galton.csv').drop(3)
galton

**Question:** define a function called `average_height` that takes two arguments and returns their average.

In [None]:
# ...

**Question:** use the `apply` method to create an array with the average height of `mother` and `father` for each row in the `galton` table.

In [None]:
# ...

**Question:** add a new column of average parent heights to the table, with the column name `midparentHeight`.

In [None]:
# ...

## Prediction ##

Can we use the average height of a child's parents to predict the child's height? Scatter plots are a good way to look for relationships between two variables:

In [None]:
galton.scatter('midparentHeight', 'childHeight')

Suppose that a child's parents have an average height of 68 inches. How might we predict that child's height? A reasonable approach would be to identify other children whose parents' average height is close to 68 inches, and use the average heights of these children.

Let's identify the rows of the table where heights are within 0.5 inches of 68:

In [None]:
galton.scatter('midparentHeight', 'childHeight')

# Add vertical red lines to the scatter plot, to indicate the range [67.5, 68.5]
plots.plot([67.5, 67.5], [55, 80], color='red', lw=2)
plots.plot([68.5, 68.5], [55, 80], color='red', lw=2);

In [None]:
# Use the where method to identify rows where parents' average height is within 0.5 inches of 68
nearby = galton.where('midparentHeight', are.between(67.5, 68.5))
nearby

In [None]:
# Calculate the average height of children in these rows
nearby_mean = nearby.column('childHeight').mean()
nearby_mean

In [None]:
galton.scatter('midparentHeight', 'childHeight')
plots.plot([67.5, 67.5], [55, 80], color='red', lw=2)
plots.plot([68.5, 68.5], [55, 80], color='red', lw=2)

# Mark the predicted height for children whose parents' average height is 68 inches
plots.scatter(68, nearby_mean, color='red', s=70);

Now we have a sequence of steps for predicting a child's height: identify rows in the table where `midparentHeight` is within 0.5 inches of their parents' average height, then calculate the average of `childHeight` for those rows. 

It would be a good idea to write a function to perform these steps, so that we can reuse it again and again:

In [None]:
def predict(h):
    """
    Predict the height of a child based on the average height h of their parents.
    """
    nearby = galton.where('midparentHeight', are.between(h - 1/2, h + 1/2))
    return np.average(nearby.column('childHeight'))

In [None]:
predict(68)

In [None]:
predict(70)

In [None]:
predict(73)

**Question:** use the `apply` method to predict the height of every child (row) in the table `galton`. Add this array of predictions back into the table as a new column `predictedHeight`.

In [None]:
# ...

Let's see how our predictions look on the scatter plot. Instead of providing a single column name as the second argument to `scatter`, we can provide a list of column names, and `scatter` will plot both of those columns:

In [None]:
galton.scatter('midparentHeight', ['childHeight', 'predictedHeight'])

How accurate is our approach to predicting heights?

**Question:** write a function called `difference` that takes two arguments, and returns the difference between them.

In [None]:
# ...

**Question:** use the `apply` method to create an array of prediction errors for each row. Add this array to the `galton` table as a new column `errors`.

In [None]:
# ...

Let's look at the distribution of errors with a simple histogram:

In [None]:
galton.hist('errors')

If we'd like, we can plot separate histograms for different groups of rows, based on the value of a particular variable. For example, we can create a separate histogram for the error of boys and girls, by adding the argument `group = 'gender'` to group rows by the `gender` column.

In [None]:
galton.hist('errors', group='gender')

What do you notice about these two histograms?

In [None]:
# ...

## Discussion Activity

**Question 1:** how can we also account for gender when making predictions about a child's height? Don't write any code yet; just explain what you would do to modify our prediction approach in a way that incorporates gender as well.

In [None]:
# ...

**Question 2:** Create a new function `predict_smarter`, which takes two arguments: `h` for height, and `g` for gender. Define this function so that it returns a prediction for height using both `h` and `g`.

In [None]:
# ...

In [None]:
predict_smarter(68, 'female')

In [None]:
predict_smarter(68, 'male')

In [None]:
# Apply the predict_smarter function to get a new prediction for all children in the table
smarter_predicted_heights = galton.apply(predict_smarter, 'midparentHeight', 'gender')

# Add these predictions as a new column
galton = galton.with_column('smartPredictedHeight', smarter_predicted_heights)
galton

In [None]:
# Apply the difference function once again to calculate errors for smartPredictedHeight
smarter_pred_errs = galton.apply(difference, 'childHeight', 'smartPredictedHeight')
galton = galton.with_column('smartErrors', smarter_pred_errs)
galton

In [None]:
# Plot the distributions of errors for male and female children
galton.hist('smartErrors', group='gender')

## Grouping by One Column

Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. This is easy in Python using the table method `group`.

We've seen the `group` method before in this class, when creating bar charts to visualize distributions of categorical variables:

In [None]:
# Load the table of highest-grossing movies
top_movies = Table.read_table('data/top_movies_2017.csv')

# Add a column of ages to the table
ages = 2023 - top_movies.column('Year')
top_movies = top_movies.with_column('Age', ages)
top_movies 

In [None]:
# Use the group method to count how many rows in the table have each value of Studio
studio_distribution = top_movies.group('Studio')
studio_distribution

When we call `group` with one argument (the name of a column in the table, e.g., "Studio"), `group` will count how many rows have this value. What if we don't want to count the values? We can supply a function as a second argument to `group`:

In [None]:
# Use group to calculate the average of the other columns for each studio
top_movies.group('Studio', np.average)

In [None]:
# Use group to calculate the min values of the other columns for each studio
top_movies.group('Studio', min)

In [None]:
top_movies.group('Age', max)

## Data Cleaning: Class Data Survey

Thanks to everyone who filled out the class data survey! Let's take a look at the data.

In [None]:
# Load the class data survey
survey = Table.read_table('data/cmpsc5a-classdata-w23.csv')
survey

In [None]:
# Use the group function to count how many respondents are in each major
survey.group('Major').show()

**Question:** do you see any problems with these values?

In [None]:
# ...

In [None]:
# Get an array of the "unique" values of Major
# Why do some majors appear multiple times?
survey.group('Major').column('Major')

In [None]:
len(survey.group('Major').column('Major'))

*Data cleaning* is an important part of data science. Raw datasets will often have inconsistent values, or even missing values, which we will have to deal with before doing any analysis.

In [None]:
def clean_major(major):
    """
    Given a raw value of the Major variable, return a new string with the following changes:
      - all characters are lower case
      - all spaces and hyphens are removed
      - common words are replaced with their acronyms
    """
    
    # force all characters to be lower-case
    major = major.lower()                 
    
    # remove spaces and hyphens
    major = major.replace(' ', '')
    major = major.replace('-', '')
    
    # replace words with common acronyms
    major = major.replace('biology', 'bio')
    major = major.replace('economics', 'econ')
    major = major.replace('psychology', 'psych')
    major = major.replace('psychological', 'psych')
    major = major.replace('chemistry', 'chem')
    
    # some miscellaneous changes
    major = major.replace('communication', 'communications')
    
    return major    

In [None]:
# Add a column "clean_major", where we have applied the clean_major function
survey = survey.with_column(
    'cleaned_major',
    survey.apply(clean_major, 'Major'))

In [None]:
survey.group('cleaned_major').show()

In [None]:
len(survey.group('cleaned_major').column('cleaned_major'))

This looks much better! We've gone from 45 unique values of the Major variable down to 28.

In [None]:
# Look at average values for each column by major
survey.group('cleaned_major', np.average)

**Question:** the value `nan` indicates that the average could not be computed. Why is this happening?

In [None]:
# ...