## Data Cleaning, Describing, and Visualization

### Step 1 - Get your environment setup

1. Install Git on your computer and fork the class repository on [Github](https://github.com/tfolkman/byu_econ_applied_machine_learning).
2. Install [Anaconda](https://conda.io/docs/install/quick.html) and get it working.

### Step 2 - Explore Datasets

The goals of this project are:

1. Read in data from multiple sources
2. Gain practice cleaning, describing, and visualizing data

To this end, you need to find from three different sources. For example: CSV, JSON, and API, SQL, or web scraping. For each of these data sets, you must perform the following:

1. Data cleaning. Some options your might consider: handle missing data, handle outliers, scale the data, convert some data to categorical.
2. Describe data. Provide tables, statistics, and summaries of your data.
3. Visualize data. Provide visualizations of your data.

These are the typical first steps of any data science project and are often the most time consuming. My hope is that in going through this process 3 different times, that you will gain a sense for it.

Also, as you are doing this, please tell us a story. Explain in your notebook why are doing what you are doing and to what end. Telling a story in your analysis is a crucial skill for data scientists. There are almost an infinite amount of ways to analyze a data set; help us understand why you choose your particular path and why we should care.

Also - this homework is very open-ended and we provided you with basically no starting point. I realize this increases the difficulty and complexity, but I think it is worth it. It is much closer to what you might experience in industry and allows you to find data that might excite you!

In [146]:
from bs4 import BeautifulSoup
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(style='ticks', palette='Paired')
import statsmodels.api as sm
import re

# NCAA 2017 Wrestling Championships

[Flowrestling 2017 NCAA Results](https://www.flowrestling.org/results/5997906-2017-ncaa-championship-results/4209)

## Get and Clean the Data

We first need to pull the data from Flowrestling.

In [148]:
url = requests.get('https://www.flowrestling.org/results/5997906-2017-ncaa-championship-results/4209')
text = url.text
soup = BeautifulSoup(text,'lxml')

We then get each of the matches from the table, storing the info in an array. This information includes the weight, winning wrestler and his school, the type of victory, losing wrestler and his school, and the finial result.

In [149]:
matches = []
for match in soup.findAll(['a','br']):
    if re.search('\d{3}',match.text[0:3]) or re.search('\d{3}',str(match.next_sibling)[0:3]):
        if re.search('(\d{3})[\s\S]+?([^\(]+)\(([^\)]+)\) ([A-Z]+) ([^\(]+)\(([^\)]+)\), ([\s\S]+)',match.text):
            text1 = match.text
        elif re.search('(\d{3})[\s\S]+?([^\(]+)\(([^\)]+)\) ([A-Z]+) ([^\(]+)\(([^\)]+)\), ([\s\S]+)',str(match.next_sibling)):
            text1 = str(match.next_sibling)
            
        weight = re.search('(\d{3})[\s\S]+?([^\(]+)\(([^\)]+)\) ([A-Z]+) ([^\(]+)\(([^\)]+)\), ([\s\S]+)',text1).group(1).strip()
        w1 = re.search('(\d{3})[\s\S]+?([^\(]+)\(([^\)]+)\) ([A-Z]+) ([^\(]+)\(([^\)]+)\), ([\s\S]+)',text1).group(2).strip()
        s1 = re.search('(\d{3})[\s\S]+?([^\(]+)\(([^\)]+)\) ([A-Z]+) ([^\(]+)\(([^\)]+)\), ([\s\S]+)',text1).group(3).strip()
        dec = re.search('(\d{3})[\s\S]+?([^\(]+)\(([^\)]+)\) ([A-Z]+) ([^\(]+)\(([^\)]+)\), ([\s\S]+)',text1).group(4).strip()
        w2 = re.search('(\d{3})[\s\S]+?([^\(]+)\(([^\)]+)\) ([A-Z]+) ([^\(]+)\(([^\)]+)\), ([\s\S]+)',text1).group(5).strip()
        s2 = re.search('(\d{3})[\s\S]+?([^\(]+)\(([^\)]+)\) ([A-Z]+) ([^\(]+)\(([^\)]+)\), ([\s\S]+)',text1).group(6).strip()
        result = re.search('(\d{3})[\s\S]+?([^\(]+)\(([^\)]+)\) ([A-Z]+) ([^\(]+)\(([^\)]+)\), ([\s\S]+)',text1).group(7).strip()
        
        matches.append([weight, w1, s1, dec, result, 1])
        matches.append([weight, w2, s2, dec, result, 0])        

Now we use pandas to get a first look at the data

In [150]:
names = ['weight', 'name', 'school', 'victory', 'result', 'winner']
data = pd.DataFrame(matches, columns=names)
data.head()

Unnamed: 0,weight,name,school,victory,result,winner
0,197,J`den Cox,Missouri,DEC,8-2,1
1,197,Brett Pfarr,Minnesota,DEC,8-2,0
2,285,Kyle Snyder,Ohio St.,DEC,6-3,1
3,285,Connor Medbery,Wisconsin,DEC,6-3,0
4,125,Darian Cruz,Lehigh,DEC,6-3,1


We need to make a few binaries for the type of victory, so first we describe the values of the variable, then we define the variables.

In [151]:
data.victory.value_counts()

DEC    438
MD     106
F       92
TF      34
Name: victory, dtype: int64

In [152]:
data['DEC'] = (data.victory == 'DEC').astype('int')
data['MD'] = (data.victory == 'MD').astype('int')
data['F'] = (data.victory == 'F').astype('int')
data['TF'] = (data.victory == 'TF').astype('int')

Now we get the scores for those who won but did not pin.

In [153]:
data['score'] = ''
data['score'].loc[data['winner'] == 1] = data.result.str.extract('(\d+)-',expand=False)
data['score'].loc[data['winner'] == 0] = data.result.str.extract('-(\d+)',expand=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


A quick sanity check to ensure we did not pick up anything for those matches that pinned.

In [155]:
data.loc[data['victory'] == 'F'].head()

Unnamed: 0,weight,name,school,victory,result,winner,DEC,MD,F,TF,score
14,165,Vincenzo Joseph,Penn St.,F,5:24,1,0,0,1,0,
15,165,Isaiah Martinez,Illinois,F,5:24,0,0,0,1,0,
22,141,Kevin Jack,North Carolina St.,F,6:22,1,0,0,1,0,
23,141,Bryce Meredith,Wyoming,F,6:22,0,0,0,1,0,
30,184,Tj Dudley,Nebraska,F,2:40,1,0,0,1,0,


Get times for those who pinned and verify it worked correctly.

In [166]:
data['times'] = ''
data['times'].loc[data['F'] == 1] = (data.result.str.extract('(\d+):',expand=False)).astype('float')*60 + (data.result.str.extract(':(\d+)')).astype('float')
data[['times','result']].loc[data['F'] == 1].head()

  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,times,result
14,324,5:24
15,324,5:24
22,382,6:22
23,382,6:22
30,160,2:40


We also need to mark who the champions are, since we want to evaluate what the best wrestlers are doing.

In [183]:
data['champ'] = 0
data.loc[[0,2,4,6,8,10,12,14,16,18],'champ'] = 1
data['champ'] = data.groupby(['name'])['champ'].transform(max)

Unnamed: 0,weight,name,school,victory,result,winner,DEC,MD,F,TF,score,times,champ
0,197,J`den Cox,Missouri,DEC,8-2,1,1,0,0,0,8,,1
1,197,Brett Pfarr,Minnesota,DEC,8-2,0,1,0,0,0,2,,0
2,285,Kyle Snyder,Ohio St.,DEC,6-3,1,1,0,0,0,6,,1
3,285,Connor Medbery,Wisconsin,DEC,6-3,0,1,0,0,0,3,,0
4,125,Darian Cruz,Lehigh,DEC,6-3,1,1,0,0,0,6,,1


In [3]:
data = pd.read_csv('Iris.csv')
data.head() '"'

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
data.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [8]:
test = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data', header=None, sep=" ", quotechar='"')
test.head()

ParserError: Error tokenizing data. C error: Expected 32 fields in line 8, saw 33


In [6]:
test.columns

Int64Index([0], dtype='int64')

In [17]:
import requests
import re
url = requests.get('http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data')
text = url.text
text = re.sub(r'  +',',',text)
text = re.sub(r'	',',',text)
j = open('test.csv','w')
j.write(text)
j.close()
test = pd.read_csv('test.csv', header=None, quotechar='"')
test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [2]:
import requests
import re
#url = requests.get('http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data')
url = requests.get('https://www.flowrestling.org/results/5997906-2017-ncaa-championship-results/4209')
print(url.text)

<!DOCTYPE html><html lang="en"><head><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0"><style id="primary-styles">@charset "UTF-8";@font-face{font-family:Open-Sans;src:local("Open Sans"),local("Open-Sans"),url(https://fonts.googleapis.com/css?family=Open+Sans:400,600)}@font-face{font-family:Industry-Black;src:local("Industry Black"),local("Industry-Black"),url(https://app30.flosports.tv/assets/fonts/industry/32BAB0_0_0.eot);src:local("Industry Black"),local("Industry-Black"),url(https://app30.flosports.tv/assets/fonts/industry/32BAB0_0_0.eot?#iefix) format("embedded-opentype"),url(https://app30.flosports.tv/assets/fonts/industry/32BAB0_0_0.woff2) format("woff2"),url(https://app30.flosports.tv/assets/fonts/industry/32BAB0_0_0.woff) format("woff"),url(https://app30.flosports.tv/assets/fonts/industry/32BAB0_0_0.ttf) format("truetype")}@font-face{font-family:Industry-Book;src:local("Industry Book"),local("Industry-Book"),url(https://app30.fl