# Unit 10 Live Session

Welcome to the first full class!


## Really boring file stuff that may one day save your...

The following cell throws an error while trying to read in a data file.  See if you can understand what the problem is.

In [69]:
import csv

with open('Nations.txt', 'r') as csv_file:
    file_reader = csv.reader(csv_file, delimiter = ' ')
    data = []
    for row in file_reader:
        data.append(row)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 1859: invalid start byte

Sadly, this type of error is really common when you find a new source of data.

Notice that the error gives you the contents of a specific byte: 0x82.  It also tells you a position, but we can't really use that because it's the position in a buffer (a temporary memory location) instead of in the file, so it's not very helpful.  Here's how you would create the byte in question:

In [79]:
byte = b'\x82'

Depending on what encoding we use, this could mean different things

In [None]:
byte.decode('utf-8')

In [None]:
byte.decode('windows-1250')

In [None]:
byte.decode('macintosh')

If you know what kind of machine the file was made on, there's a chance you can guess the encoding.  Sometime the unix `file` command can help.

In [None]:
!file Nations.txt
!file -I Nations.txt

There's also a python library, `chardet`, which tries to help you determine what encoding you have.

In [None]:
import chardet
with open('Nations.txt', 'rb') as csv_file:
    print(chardet.detect(csv_file.read()))

Often, you just have to go into your file manually and find and correct the encoding errors.  It can help to use the open command with the `errors` argument set to `"replace"`.  This will let you read in the file, but flag the bytes that are causing trouble with a special character.

In [None]:
with open('Nations.txt', 'r', errors = 'replace') as csv_file:
    file_reader = csv.reader(csv_file)
    for row in file_reader:
        print(row)

## Basic NumPy Operations

In [9]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  

print (np.__version__)

1.15.2


The file scores.csv contains actual grades from a recent semester of Python (names have been stripped and rows permuted to prevent de-anonymization).

Examine the contents of the file.  Then read in the data and store it in two numpy arrays:  one called midterm and one called final.

Compute the mean score for the midterms and the mean score for the finals.

Compute how many students there are.

Create a boolean array called improving, which is True for students that score higher on the final than the midterm, and False for other students.

Next, look at **only** the students that did better on the final than the midterm, and find their mean final score.

Do the same for the remaining students - those that did worse on the final, or the same on the midterm and the final.

Compute how many students dropped 10 points or more from their midterm score to their final score

We become concerned that the average for the final is too low.  To correct this, we decide to randomly award some bonus points.  Specifically, for each student in the class, with probability 30%, we want to add a single bonus point to the final score.

First, please set the random seed by executing the instruction below.  This will ensure that different groups get the same result.

Second, generate an array of random numbers, each a uniform draw between zero and one.  (Check out np.random.rand)

You have to figure out what the next steps are!  Store your result in a new ndarray called curved_final (don't overwrite the original final scores).

In [None]:
np.random.seed(100)

# Your code here

Run the instruction below to generate a scatterplot of midterm scores versus final scores

In [None]:
plt.scatter(midterm, final)

Now place the score data into a two-dimensional array with two columns.

You want each row to represent a different student.  You want the 0th column to represent the midterm score and the 1st column to represent the final score.  Call your new array all_grades and check its shape to see if it's correct.

Use array indexing on all_grades (square brackets) to pull out the array of scores for the student at index 10.

## Linear Modeling in Numpy

You develop a linear model that predicts how a student will do on the final based on their midterm grade .  Your model is as follows:

$$final = w_1 midterm + w_2$$

You write this as the product of matrices:

$$final = \begin{bmatrix}
    midterm & 1
\end{bmatrix}
\begin{bmatrix}
    w_1 \\ w_2
\end{bmatrix}
$$

If you remember your matrix algebra, these two equations mean exactly the same thing.

You estimate $w_1=.8$ and $w_2 = 5$.

Create a numpy array called w to hold the matrix above.  You will need to use reshape() to get it into the right dimensions.  Check w.shape to make sure it is (2,1). 

Take the midterm score for student at the 10 and turn it into a matrix that looks like `[midterm 1]`.  Make sure the shape of your matrix is (1,2).  (You may have to reshape it - this is important because numpy likes to drop dimensions when you're not looking, and that can lead to mistakes)

Compute the matrix product of your new matrix with w (use np.dot).  This will give you the model prediction for the final grade for this student.

Next, create a matrix to hold all the midterm scores with 1's as follows.  Call it all_midterm

```
| midterm_1  1|
| midterm_2  1|
|   ...    ...|
| midterm_n  1|
```

Find the matrix product of all_midterm with w.   if you think about how matrix multiplication works, this will give you an array of predicted final grades for each student.  Store the result in a variable called predicted.

Find the maximum overall grade predicted by your model.

## Loss Functions in Numpy

The idea of a loss function is central to modern machine learning, and one of the main reasons ML came roaring back after the "AI winter" of 1987-93.

In the context of a linear regression, statisticians most often use the following loss function.

$$loss = \sum (\text{predicted_y} - \text{actual_y})^2 $$

Write a function that takes the model coefficients, w_1 and w_2, and computes the loss function that results from that model.

In [None]:
def loss(w_1, w_2, all_midterm, final):
    return 0

In [99]:
import re

w1 = np.array([])
w2 = np.array([])
cost = np.array([])



In [119]:
coefs = re.findall(r'-?\d+\.?\d*', input('enter a pair or coefficients to try'))
coefs = np.array(coefs).astype('float64')
print("adding coefficients:" , coefs)
print(coefs[0])
w1 = np.append( w1, coefs[0])
w2 = np.append( w2, coefs[1])
print(w1, w2)
cost = np.append(cost, loss(coefs[0], coefs[1], all_midterm, final))

plt.scatter(x=w1, y = w2, c = cost, cmap = 'copper')
plt.colorbar()

enter a pair or coefficients to try 1 3


adding coefficients: [1. 3.]
1.0
[2. 1.] [4. 3.]
