<h1>
<center>
Module 1: First look at text-based classification
</center>
</h1>


<p>The first real problem I'd like to look at in the course is classifying tweets as carrying fake-news (or not). But before getting to that in later modules, we need to pick up skills in what is called data wrangling and feature engineering. We will do that in this module. I am going to use a standard tutorial-type data set for machine learning: the passenger record of the Titanic steamship. The Titanic sunk on its maiden voyage. We have the record of the passengers. We will do a practice problem of predicting who survived and who perished based solely on their name. Will this be effective? Seems kind of like reading Tarot cards. But let's keep an open mind. Maybe it will work.
<p>
Many text-based machine-learning problems contain their data in spreadsheet form. Python has a powerful library for dealing with spreadsheets called pandas. In this module we will use a handful of features from the *`pandas`* library. I'll go through some basic clean-up steps using pandas. Common wisdom is that the clean-up process can take up to 70% of your entire effort. Life is messy. Text data comes to us in unstructured forms. We have to deal with it.


<hr>
<h1>
Read in spreadsheet
</h1>


For the first part of the course, we will be working on a problem called classification. The data we will be using to make classifications will be in spreadsheet form (I'll also call this *table* form).

We could read in the data to our own custom Python data-structure. Instead we will use the pandas library to store our data and modify it.

I am going to use something called comma-separated values or csv as my raw file format. I like csv because you can use it to pass data around easily from things like Excel and google Sheets. And pandas knows how to read raw csv format and produce its own version called a Dataframe. Our week 2 goal is to read a table of tweets, in csv form, and classify them as fake-news or not.

Caveat: I said we are interested in classification (e.g., fake-news or not) but I'll use the term `prediction` for the titanic. You can classification and prediction as interchangeable for now. I could say I am trying to `predict` who will survive or I could say I am trying to `classify` passengers into survivors and non-survivors. We will use the same methods for each.

I have the titanic data stored on google sheets. I used sheets to give me a url to the csv version of the file. Once I have that url, I can hand it to pandas and suck it in. Pretty dang cool. You all have access to Google Sheets so you can do the same. If you have data in spreadsheet form, upload it to Sheets and then get the url. Now anyone can access your spreadsheet.

BTW: it is convention to alias pandas as `pd`. It is also convention to use `df` as an abstract name for a Dataframe - you will see this in docs and StackOverflow. I am using `titanic_table` in place of `df` to give it more meaning.



In [0]:
import pandas as pd

url = 'https://docs.google.com/spreadsheets/d/1z1ycUZjJpmMWB4gXbhwRQ9B_qa42CwzAQkf82mLibxI/pub?output=csv'
titanic_table = pd.read_csv(url)
len(titanic_table)

891

In [0]:
#I am setting the option to see all the columns of our table as we build it, i.e., it has no max.
pd.set_option('display.max_columns', None)

In [0]:
titanic_table.head()  #shows first 5 rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<h2>Google Colab</h2>

I will run all my notebooks through google colab. So I assume you downloaded this notebook from canvas and then uploaded it to your colab account.

<hr>
<h1>
Explore
</h1>

We now have the 891 passengers in 891 rows of a table. We can use pandas methods to look a little more deeply at the data.

* Use `head()` to get general layout. We did that above.</li>
* Find which columns have `NaN` (empties) and how many.</li>
* Use `describe` method to see if any odd looking columns, e.g., more than 2 unique values for a binary column.</li>


In [0]:
titanic_table.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Sheerlinck, Mr. Jan Baptist",male,,,,347082.0,,G6,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


<div class=just_text>

There are a mixture of column types. Some have discrete values (e.g., `Pclass`, `Sex`, `Embarked`), some have continuous values (e.g., `Age`, `Fare`), and some are in between (e.g., `SibSp`, `Parch`). The `Name` column has text values. The `Ticket` and `Cabin` columns are a bit of a hodge podge and will take further wrangling to make them useful.

Note that a `NaN` has several meanings. In the table above, it means "does not apply". For instance there is no std for the Name column so shows a NaN. More typically, a NaN will appear as a value in a table to stand for "empty - no known value". One more thing to note about it. It is not a string but a special value of pandas. So an attempt to do NaN == "NaN" will be false. You will have to use special pandas functions for dealing with a NaN.

Let's next see how many empties there are in each column.
<div>

In [0]:
titanic_table.isna().sum()  #note use of isna to find the NaNs.

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


* The `Age` column is a bit worrisome. It looks like a column that can be useful in prediction but has 177 empty values.

* The `Cabin` column has a lot of empties. I am dubious that the column as a whole will be useful. However, it might make sense to use an empty/non-empty question. For instance, maybe passengers with non-empty cabins were more likely to survive.

* The `Embarked` column has only 2 empties and that seems like something we can fill in.


<hr>
<h1>Filter out unneeded columns</h1>
<p>
<div class=h1_cell>
<p>

I am really only interested in the `Name` column and the `Survived` column. Since we are trying to predict Survived values, it is known as the target column or label column or just plain y. The other columns are called features or xi. I am saying that we will only be interested in Name so it is the sole feature (for now).
<p>
 My goal is to create a new table with just those 2 columns. There are 2 ways to go: (1) drop all the other columns, (2) copy over only the needed columns. I'll show you both ways. First, I'll first use the columns attribute to obtain all the columns. I turn this into a list to make it print more cleanly. I am doing this in prepraration of dropping most of them. I am being lazy - I just want to copy and paste the output into the drop method.
<p>
Note in the drop method I am using `axis=1` to say I am dropping columns and not rows (`axis=0`). 

</div>

In [0]:
list(titanic_table.columns)

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [0]:
name_table_1 = titanic_table.drop(['PassengerId',
                                    'Pclass', 
                                    'Sex',
                                     'Age',
                                     'SibSp',
                                     'Parch',
                                     'Ticket',
                                     'Fare',
                                     'Cabin',
                                     'Embarked'], axis=1)

In [0]:
name_table_1.head()

Unnamed: 0,Survived,Name
0,0,"Braund, Mr. Owen Harris"
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,1,"Heikkinen, Miss. Laina"
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,0,"Allen, Mr. William Henry"




Most pandas operations make shallow copies of a table. This is true above: the drop method gives me a new table. Normally I would just reassign new table to `titanic_table`. This avoids keeping a lot of variables around like `titanic_table_1`, `titanic_table_2`, etc. I find trying to manage such a name space clumsy. It is true my way does not allow you to roll back to a prior version of the table. But you can "roll forward" by just restarting the kernel and executing all of the cells from the top of the notebook to get to a specific state.

All that said, I am using a new var name above to demonstrate something. That comes next. Instead of dropping a bunch of columns, let's just add the 2 we want. Nice.


In [0]:
name_table_2 = titanic_table[['Name', 'Survived']]

In [0]:
name_table_2.head()

Unnamed: 0,Name,Survived
0,"Braund, Mr. Owen Harris",0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
2,"Heikkinen, Miss. Laina",1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
4,"Allen, Mr. William Henry",0


<hr>
<h2>That's what I'm talkin about</h2>
<p>
<div class=h1_cell>
<p>
We trimmed down to the two columns we need. But as a warm up for word-vectorization in later modules, I am going to add a new column that is based on the Name column.
</div>

In [0]:
#I'm going to reuse titanic_table var name to avoid proliferating names. If need to get full table back, redo steps at top of notebook.

titanic_table = name_table_2  #or name_table_1 - they are equiv

<h2>Numerology</h2>

I have a theory that the length of your full name gives a clue to your future. I'm going to add a new column, `Length`, so I can test this out a little later. You can see below that pandas makes this pretty easy to do.

What is going on on the right hand side is that pandas `apply` is generating every row in turn and then passing that row to my lambda expression. The value returned by that lambda expression goes into the new column `Length`. If you like list comprehensions better, you can use this:
<pre>
titanic_table['Length'] = [len(row['Name']) for index,row in titanic_table.iterrows()]
</pre>

The iterrows method gives you the same functionality but also includes the row index (which we are not using).


In [0]:
titanic_table['Length'] = titanic_table.apply(lambda row: len(row['Name']), axis=1)
titanic_table.head()

Unnamed: 0,Name,Survived,Length
0,"Braund, Mr. Owen Harris",0,23
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,51
2,"Heikkinen, Miss. Laina",1,22
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,44
4,"Allen, Mr. William Henry",0,24


<div class=just_text>
If you squint, you can almost believe that those who perished had shorter names.
</div>

<hr>
<h2>Write the table out</h2>

Let's save the work we have done with the table. Because I am using google colab, I have to autheticate myself before I can store the file. Note that I created a folder, `class_tables`, on My Drive on google drive. You can make up your own folder name if you wish.
<p>
  The first time you run this, you will be given a key to fill in and a website to visit. The website gives you the key. Copy it and type it in and hit enter.
</div>

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
with open('/content/gdrive/My Drive/class_tables/name_table.csv', 'w') as f:
  titanic_table.to_csv(f, encoding='utf-8', index=False)

<hr>
<h1>
Use K-NN for classification
</h1>
<div class=h1_cell>
<p>
I said I was interested in predicting the Survived value for any passenger. This is a machine learning problem.  There are many machine-learning methods I might employ. But I am not ready to get into a comparison or survey at this point. I am just going to choose K Nearest Neighbor (K-NN) because it will get us going the fastest. It has its issues, but it is straightforward to build. And guess what, I am going to ask you to build it. It will look nice on your resume: "I built K-NN from scratch."
<p>
I'll meet you over on the assignment notebook. I'll ask you to add more columns (features) to our table and then we can get K-NN built and then see if Numerology is legit. [spoiler alert] You might be surprised.
</div>