# Class 2: Intro to Simple Models

## 1) Simplest Model
The simplest model is just using an average for the future.  
In math, an average is called a "mean".

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# our data for the grades a student got in their last 10 classes
grades = [85, 90, 75, 80, 65, 70, 85, 95, 60, 85]
grades = np.array(grades)

In [None]:
# calculate the average grade
np.mean(grades)

Now we can make predictions for what grade a student might get in their next class. It's a very simple model.  
How could we make it better?

In [None]:
# here are the types of classes
class_type = [2, 2, 1, 1, 1, 1, 2, 2, 1, 2]
class_type = np.array(class_type)

In [None]:
# now let's filter on class type 2
grades_for_class_2 = grades[class_type == 2]

In [None]:
# now calculate the average
np.mean(grades_for_class_2)

## 2) Linear Model with Our Data

In [None]:
# read in our dataset
filename = 'https://raw.githubusercontent.com/sayhellojoel/grade78pythonmath/main/Data/kids%20anonymous%20data.csv'
df = pd.read_csv(filename)

In [None]:
# show the first 5 lines from our dataset
df.head(5)

### Scatterplot
It's always a good idea to look at your data before trying anything else.  
Here, we can visually see the trend in the data.  

In [None]:
# let's visualize our data to start with
plt.scatter(df['Age Decimal'], df['HEIGHT'])
plt.xlabel('Age')
plt.ylabel('Height (cm)')
plt.show()

In [None]:
m, b = np.polyfit(df['Age Decimal'], df['HEIGHT'], 1)
print('The slope is', m)
print('The intercept is', b)

In [None]:
min_value = min(df['Age Decimal'])
max_value = max(df['Age Decimal'])
x = [min_value, max_value]
x = np.array(x)
x

In [None]:
y = m*x + b
y

In [None]:
# let's visualize our data to start with
plt.scatter(df['Age Decimal'], df['HEIGHT'])
plt.plot(x, y, color='red')
plt.xlabel('Age')
plt.ylabel('Height (cm)')
plt.show()

## Checking Other Variables
Now let's check if any other variables look like they could predict height.

In [None]:
df.columns

In [None]:
# let's visualize our data to start with
plt.figure(figsize=(15, 3))

plt.subplot(1, 4, 1)
plt.scatter(df['Age Decimal'], df['HEIGHT'])
plt.xlabel('Age')
plt.ylabel('Height (cm)')

plt.subplot(1, 4, 2)
plt.scatter(df['FOOT LENGTH'], df['HEIGHT'])
plt.xlabel('Foot Length')

plt.subplot(1, 4, 3)
plt.scatter(df['INDEX FINGER'], df['HEIGHT'])
plt.xlabel('Finger Length')

plt.subplot(1, 4, 4)
plt.scatter(df['# LETTERS'], df['HEIGHT'])
plt.xlabel('Number of Letters in Name')

plt.show()

In [None]:
# check correlation of each variable with height
correlation = df[['Age Decimal', 'HEIGHT', 'FOOT LENGTH', 'INDEX FINGER', '# LETTERS']].corr()

# Print the correlations
print(correlation['HEIGHT'].round(2))