# Lecture 1

**Welcome to the first CX Lecture!** Here I'll be walking through some preparatory steps to onboard you into the world of data science and give you a glimpse of just how powerful these tools can be.

# But first, imports

Data science thrives off the backs of a million different packages and tools, which you will likely become extremely familiar with and hopefully contribute to yourself in the future! Here's how we use them

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# NumPy! (does not rhyme with lumpy)

<img src="numpy.png" alt="Drawing" style="width: 500px; height: 250px"/>

An introduction to your best friend in the realm of data science. Numpy is an optimized math library for Python. The code is vectorized as much as possible, which means that there's a heavy focus on using arrays (treated as n-dimensional vectors) to do operations. This is a shift away from doing looped operations, and much, much **much** faster under the hood.

For example:
If you wanted to compute the **dot product** of 2 arrays, `[1, 2, 3, 4, 5]` and `[5, 4, 3, 2, 1]` = `1*5 + 2*4 + 3*3 + 4*2 + 5*1`, you could either loop through two **lists** in Python

```python
sum = 0
for v1, v2 in zip(arr1, arr2): # iterates through the lists at the same time
    sum += v1*v2
```

**Or**, you could perform all the multiplications at once, and then add them together. That's basically what NumPy does behind the scenes. So doing the dot product in NumPy is very simple:

```python
dot_product = arr1.dot(arr2)
```

Unless you take 61C, you'll never need to know just what kind of magic is going on behind the scenes (numpy is basically just C in stilts and a big trench coat), but we can all benefit from how useful it is!

Before we begin, vectors are single dimensional matrices, and a matrix is multidimensional. In Numpy, we represent both with np.arrays, as we did above. We can also use np.matrix, but arrays are faster performance wise, and matrices are less easily manipulated and are inherently 2 dimensional.

<img src="matrices_vectors.png" alt="Drawing" style="width: 800px"/>

Linear algebra is **absolutely critical** to the world of data science as a whole and while you won't need to nail these concepts fully until you take Math 54 later on in your data science journey, it's never too early to start! Their representational versatility and mathematical adaptibility make for an unbeatable data structure to use to represent whatever you need, and some mathematical operations for matrices have very interesting, unexpected applications which you'll see down the line.

In [9]:
v = np.array([1, 2, 3, 4, 5]) #creating an array
v

array([1, 2, 3, 4, 5])

In [10]:
v[2] #Indexing into it (start from 0!)

3

**Exercise**: How would you sum the 3rd and 4th element of v?

In [12]:
v[2] + v[3]

7

### Indexing 2-D Arrays in Numpy

What is a 2-D array? It's an array of arrays, aka a matrix. 

This is what a 2D list looks like in vanilla Python.
```python
A = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    ]
```

However, accessing the number 5 is not that easy. There is no built-in way to index 2 layers deep into a list. So you have to index into multiple arrays one at a time as follows:

```python
# getting the number 6 from A
# A[1] = [4, 5, 6]
# A[1][2] = 6
six = A[1][2]
```

When you store an array as an np.array, you are not only gaining a runtime speedup, you're also getting a speedup in writing your code because you now have advanced indexing!

Now, we'll show how to index in a similar array in numpy's array format. You can find more info in greater detail [here](https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.indexing.html).

### Subarrays and Submatrices

_`np.random.randn` takes in 2 arguments, where the 1st argument is the number of rows and the 2nd argument is the number of columns. It creates a matrix of those dimensions with random numbers from the standard normal distribution._

In [14]:
random = np.random.randn(3, 4)
random

array([[-1.00157856,  0.04807301, -0.05234902,  0.7060537 ],
       [ 0.33759888, -0.13609568, -0.29128974, -1.55680288],
       [-0.71510957,  0.66373481,  0.66470032, -0.11723992]])

In [15]:
bigger_random = random * 10
bigger_random

array([[-10.01578564,   0.48073014,  -0.52349022,   7.06053698],
       [  3.37598882,  -1.36095685,  -2.91289738, -15.56802882],
       [ -7.15109568,   6.63734815,   6.64700322,  -1.17239919]])

That's better, but there's still a lot of decimals. Let's go 1 step further and round all the values with the astype function, which casts an object to a specified type. Here it'll round all the values in the array to the nearest integer.

This is an example of how easy it is to apply a function to every element in a matrix.

In [16]:
A = bigger_random.astype(int)
A

array([[-10,   0,   0,   7],
       [  3,  -1,  -2, -15],
       [ -7,   6,   6,  -1]])

**2D Indexing**

In [20]:
A[2, 2]


6

## Array Manipulation & Broadcasting

Here's an example of array multiplication, where both arrays' sizes are equal, in vanilla Python vs Numpy. Nifty, right?

In [36]:
[1, 2, 3] * [2, 5, 6]

In [20]:
a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 2.0, 2.0])
a * b

array([2., 4., 6.])

The most important thing NumPy does is **broadcasting**, which means that it allows for arithmetic operations on arrays of different shapes.

It's important because because uses less memory and is more computationally efficient. This is because broadcasting allows less memory to be moved around during the multiplication (in the example below, b is a scalar vs an array).


More information can be found [here](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html). (yay, **documentation!**)



In [21]:
b = np.array([1.0, 2.0, 3.0])
a = 2.0
a * b

array([2., 4., 6.])

In [22]:
b ** 2

array([1., 4., 9.])

In [23]:
b + 42

array([43., 44., 45.])

The rule of thumb is that NumPy does arithmetic operations pairwise, but if a certain dimension is 1, then it will **broadcast** that effect across the dimension. Broadcasting is when a smaller array is "repeated" across a larger array so they have compatible shapes, and arithmetic can be done between them.

Here's a more complicated example.

In [42]:
b # 1x3 matrix

array([1., 2., 3.])

In [44]:
a = np.zeros((3,3)) #full matrix of just 0, 3x3
a + b

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

The way this works is that we are adding the `b` row vector to every row of the matrix `a`. In effect, "stretching" `b` across `a`.

## Aggregation and Axes Operations

NumPy is also great at **aggregation**, which means *combining* values along rows or columns in arrays or matrices.

In [26]:
big_matrix = np.arange(16).reshape((4, 4)) #this creates an array from 0 to 15, and then reshapes it into a 4x4 matrix
big_matrix
test_matrix = np.zeros(10)
print(test_matrix)
reshape_test = test_matrix.reshape((2,5))
print("Reshaped:" reshape_test)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]


In [46]:
big_matrix.sum() # sums all of the elements in an array/matrix

120

The `axis` parameter is commonly used in NumPy.

When you pass in `axis=0`, that means that you want to do your operation over the columns, and `axis=1` means over the rows.

Let's go back to our `big_matrix`.

But now instead of the total sum of all the elements, we want to calculate all of the sums of each rows, or row-sums.

In [47]:
big_matrix


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [51]:
big_matrix.sum(axis=1)

In [50]:
#and the columns!
big_matrix.sum(axis=0)

Other functions such as `np.mean` also have axis arguments.

In [52]:
big_matrix.mean(axis=1)

array([ 1.5,  5.5,  9.5, 13.5])

In [54]:
big_matrix.mean(axis=0) #these are aggregation functions!

array([6., 7., 8., 9.])

**Exercise**: How would I return the **row** with the largest summed value?

## Conditions

Now we're going to see how we can select certain elements based on conditions that we specify. Sometimes you don't want all the rows and columns from a matrix you're given. For example, say we have the following array of numbers, and we want the first and last number.

In [55]:
random_array = np.arange(3, 13, 2)
random_array

array([ 3,  5,  7,  9, 11])

One way to do this is to use boolean indexing, where you put a `True` for the ones you want, and a `False` for the ones you don't, like this:

In [56]:
random_array[[True, False, False, False, True]]

array([ 3, 11])

But that isn't always feasible. Let's look at another example, where we have the first 25 Fibonacci numbers.

In [57]:
# This code generates the first 25 elements of the Fibonacci sequence (a series of
# numbers in which each number is the sum of the two preceding numbers)
# It's a cool exercise to figure out how this works, try it out at home!

A = np.array([
    [1, 1],
    [1, 0]
])

fib = np.zeros(25)

start = np.array([1, 0])
curr_A = A

fib[0] = 0
fib[1] = 1

for i in np.arange(2, 25):
    fib[i] = (curr_A @ start)[0]
    curr_A = A @ curr_A

fib = fib.astype(int)
fib

array([    0,     1,     1,     2,     3,     5,     8,    13,    21,
          34,    55,    89,   144,   233,   377,   610,   987,  1597,
        2584,  4181,  6765, 10946, 17711, 28657, 46368])

In [58]:
fib[[True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True]]

array([    0,     2,     8,    34,   144,   610,  2584, 10946, 46368])

But this requires you to manually look through for the ones you want, and this can take a long time. How can we tell if a number is even? Let's see what the 2 operations below yield.

In [59]:
4 % 2, 3 % 2 # 4 is even, 3 is not

(0, 1)

It turns out, the modulo operator gives the remainder when x is divided by y. We can similarly apply the modulo operator to an array, like so:

In [64]:
fib % 2 == 0

array([ True, False, False,  True, False, False,  True, False, False,
        True, False, False,  True, False, False,  True, False, False,
        True, False, False,  True, False, False,  True])

**How might we use the boolean array generated here to get what we want?**

array([    0,     2,     8,    34,   144,   610,  2584, 10946, 46368])

## Useful NumPy Functions

In [65]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [66]:
np.ones((4, 4)) #takes in a tuple specifying the number of rows and columns

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [67]:
np.zeros((5, 5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [118]:
np.eye(5) #Anyone know why this might be useful?

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [69]:
np.dot(np.array([1,2]), np.array([3,4]))

11

In [70]:
np.full((2,2),7) 

array([[7, 7],
       [7, 7]])


# An Introduction to Pandas and Data Processing

In [71]:
import pandas as pd

Well, we know we can store numbers in matrices in NumPy. But, this isn't great: compare and contrast with Microsoft Excel. NumPy seems like Excel without any of it's nice aesthetic features, like plotting graphs, etc. **Pandas** is Python's answer to this.   

**NOTE:** today, we will only be going through a handful of useful Pandas functions. To explore all of Pandas functionality more in-depth, see the full documentation here: https://pandas.pydata.org/pandas-docs/stable/

Today, we'll be diving into the **Titanic** dataset, which has the data for every passenger aboard the Titanic. We've downloaded two .csv files for you to play with in Pandas.

A Microsoft Excel file is the same as a **Comma-Separated-Value** (.csv) file: where each of the rows is it's own line, separated by commas.

Pandas allows you to convert a .csv file into a Pandas object in the following way.

<img src="matrix_df.png" alt="Drawing" style="width: 600px; height: 150px"/>

In [75]:
titanic_train = pd.read_csv('train.csv')
titanic_test = pd.read_csv('test.csv')

Data is stored in **DataFrame** objects. These might look familiar because they're matrices! Take a look

In [76]:
titanic_train.head(10) #use head function for a peek

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [77]:
type(titanic_train['Name'])

pandas.core.series.Series

Each of the columns is a **Series** object, and you can get each of them by indexing the same way as you would a **dictionary** in Python (in brackets).

In [87]:
titanic_train['Name'].head()

PassengerId
1                              Braund, Mr. Owen Harris
2    Cumings, Mrs. John Bradley (Florence Briggs Th...
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
Name: Name, dtype: object

## Indexing & Slicing

In [78]:
titanic_train.loc[2, 'Name'] # gets the name of the passenger with index 2

'Heikkinen, Miss. Laina'

The **Index** is like a special column in our dataframe that we use to uniquely identify our rows. Anything can be an index as long as each row has a unique value in this column.
Now, since each row represents a person aboard, it would make sense that `PassengerId` can be a valid index. It also makes more sense with our `.loc` function calls, e.g. to be getting the name of a `PassengerId`.

We accomplish this with the `.set_index` command.

In [79]:
titanic_train = titanic_train.set_index('PassengerId')
titanic_train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Congratulations, you are now data scientists! You just did one step of what is known as **exploratory data analysis (EDA)**.

In [80]:
#The df -> matrix analogy is more than just an analogy, we can convert back and forth

titanic_matrix = titanic_train.values
print(type(titanic_matrix))
titanic_matrix


<class 'numpy.ndarray'>


array([[0, 3, 'Braund, Mr. Owen Harris', ..., 7.25, nan, 'S'],
       [1, 1, 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', ...,
        71.2833, 'C85', 'C'],
       [1, 3, 'Heikkinen, Miss. Laina', ..., 7.925, nan, 'S'],
       ...,
       [0, 3, 'Johnston, Miss. Catherine Helen "Carrie"', ..., 23.45,
        nan, 'S'],
       [1, 1, 'Behr, Mr. Karl Howell', ..., 30.0, 'C148', 'C'],
       [0, 3, 'Dooley, Mr. Patrick', ..., 7.75, nan, 'Q']], dtype=object)

In [81]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can accomplish most of what we can in NumPy in Pandas also. For example, we can index **DataFrame** with the `.iloc` command.

In [82]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [83]:
titanic_train.iloc[3, 3] # 4th row, 4th column ==> sex of 4th passenger.

'female'

We can use iloc to grab multiple rows and columns too. Instead of indexing, this is called **slicing** our dataframe:  
`df.iloc[s_row_idx:e_row_idx, s_col_idx:e_col_idx]`.

In [84]:
titanic_train.iloc[3:5, 2:5] #2nd, 3rd, 4th columns of the 3rd and 4th rows (zero-indexed and non-inclusive of high-idx)

Unnamed: 0_level_0,Name,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
5,"Allen, Mr. William Henry",male,35.0


## Applying functions

We can call a function on elements of one of our **Series** objects through the use of the .apply function.

For **DataFrames** you can even specify if you want to apply it to the columns or rows.

In [86]:
titanic_train.apply(len, axis = 0)

Survived    891
Pclass      891
Name        891
Sex         891
Age         891
SibSp       891
Parch       891
Ticket      891
Fare        891
Cabin       891
Embarked    891
dtype: int64

In [88]:
titanic_train['Name'].apply(len)

PassengerId
1      23
2      51
3      22
4      44
5      24
       ..
887    21
888    28
889    40
890    21
891    19
Name: Name, Length: 891, dtype: int64

## Filtering & Sorting

We can use conditionals to index into our dataframe too. This will return a new dataframe, containing only the rows which meet the condition. This is called **filtering** your dataframe. 

In [90]:
survived = titanic_train[titanic_train['Survived'] == 1] # all passengers that survived.
survived.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Under the hood, you can think of filtering as **boolean indexing** into your dataframe. What does that mean? Well, `titanic_train['Survived'] == 1` actually creates a Boolean array of `True` and `False` values - one for each row in the dataframe, indicating whether the `Survived` column was 1 or not in that row. We then use that array of `True` and `False` values as our way to index into the dataframe, grabbing only the rows which have a `True` associated with them. You can see this here:

In [91]:
arr = titanic_train['Survived'] == 1 #boolean array
print(arr[0:5])

PassengerId
1    False
2     True
3     True
4     True
5    False
Name: Survived, dtype: bool


In [92]:
titanic_train['Age'].sum() / titanic_train['Age'].dropna().shape[0] # the average age of someone aboard.

29.69911764705882

**Exercise**: Find the total fare spent by women aboard the Titanic.

**Sorting** is also very useful when dealing with large amounts of data. For example, what if we wanted to sort our titanic dataframe to show us who spent the most on their ticket first. This would require us to sort the `Fare` column of the dataframe.

In [95]:
titanic_train.sort_values("Fare", ascending=False).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S


# Linear Algebra

Linear algebra is the bedrock of data science as a credible academic discipline and skilled field. Anyone with any familiarity with it knows how heavily the field deals with matrices and ways to manipulate them, and a lot of advanced techniques that come pretty deep into the course are highly applicable to working with pandas dataframes and anything of the like.

Without actually teaching you linear algebra it's a little hard to show you how cool some of these techniques are, but I can try!

**One-Hot Encoding**

We cleaned the categorical variables earlier, but they're still non-numeric, and for our models, we need to make all of our data numeric. To convert categorical variables to numeric values, we can use a method called one-hot encoding. Basically, we will make a new column for each category and set a flag of 1 or 0 – 1 if that observation is in that category, and 0 if it's not.

We basically want to achieve something that looks like this:



<img src="one_hot_1.png" alt="Drawing" style="width: 600px; height: 200px"/>

How do we do that? Luckily, pandas has a built-in function called `get_dummies()` that will do the one-hot encoding for us! One-hot encoding is also called dummy encoding.



In [98]:
one_hot = pd.get_dummies(titanic_train, columns=['Embarked']).head()
one_hot

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,0,0,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,1,0,0
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,0,0,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,0,0,1
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,0,0,1


In [117]:
one_hot.iloc[:, 10:13]

Unnamed: 0_level_0,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,0,1
2,1,0,0
3,0,0,1
4,0,0,1
5,0,0,1


This matrix is now a linearly independent, *entirely numerical* matrix, which can be manipulated for future purposes like SVD which allows us to find the most useful components.

What are some potential problems that one-hot encoding will create? (Hint: Think about about extreme cases)


For anyone interested in learning more about linear algebra without waiting to take Math 54 (frankly you should watch it even while taking the class, it's fantastic, check out the 3Blue1Brown "Essence of Linear Algebra" series for one of the best introductions I've seen to any topic ever:

https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

# How to try this on your own machine

It's really important to be able to play around with data and experiment with different EDA/cleaning techniques while you move through the stages of your data science journey. To that end, here's how you can open up a blank Jupyter notebook without needing to go to Berkeley's DataHub or another online host.

1. Install pip, a package manager that will save you dozens of hours in college alone, by running these commands in your terminal

    `curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py`

    `python3 get-pip.py`
    
2. Use pip to install jupyter lab:

    `pip3 install jupyterlab`
    
3. After this, simply typing `jupyter lab` into your terminal should open up an entirely-blank Jupyter notebook: that's how I made this one!

# 4 Groups! Vitamin!

<img src="frame.png" alt="Drawing" style="width: 300px; height: 300px"/>