https://github.com/hangtwenty/dive-into-machine-learning

###  1. A Visual Introduction to Machine Learning
http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

In machine learning, computers apply statistical learning techniques to automatically identify patterns in data. These techniques can be used to make highly accurate predictions.  In machine learning terms, categorizing data points is a classification task. Dimensions in a data set are called features, predictors, or variables. 

One example of a machine learning method is a **decision tree**. Decision trees look at one variable at a time and are a reasonably accessible (though rudimentary) machine learning method. A decision tree uses if-then statements to define patterns in data. For example, if a home's elevation is above some number, then the home is probably in San Francisco.
At the best split, the results of each branch should be as homogeneous (or pure) as possible. There are several mathematical methods you can choose between to calculate the best split.
Additional forks will add new information that can increase a tree's prediction accuracy. 

You could even continue to add branches until the tree's predictions are 100% accurate, so that at the end of every branch, the homes are purely in San Francisco or purely in New York.

These ultimate branches of the tree are called leaf nodes. Our decision tree models will classify the homes in each leaf node according to which class of homes is in the majority.
The newly-trained decision tree model determines whether a home is in San Francisco or New York by running each data point through the branches. Because we grew the tree until it was 100% accurate, this tree maps each training data point perfectly to which city it is in.

### Reality check
Of course, what matters more is how the tree performs on previously-unseen data. To test the tree's performance on new data, we need to apply it to data points that it has never seen before. This previously unused data is called test data.

Errors are due to overfitting. Our model has learned to treat every detail in the training data as important, even details that turned out to be irrelevant.

### Recap
One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data
Overfitting happens when some boundaries are based on on distinctions that don't make a difference. You can see if a model overfits by having test data flow through the model.

###  2. "A Few Useful Things to Know about Machine Learning" 
http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.
Machine learning systems automatically learn programs from data. This is often a very attractive alternative to manually constructing them, and in the last decade the use of machine learning has spread rapidly throughout computer science and beyond. Machine learning is used in Web search, spam filters, recommender systems, ad placement, credit scoring, fraud detection, stock trading, drug design, and many other applications.

"Machine learning systems automatically learn programs from data." 


### 3. Things in Pandas I Wish I'd Known Earlier
http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/things_in_pandas.ipynb


####  Loading Some Example Data


In [3]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/rasbt/python_reference/master/Data/some_soccer_data.csv')

In [4]:
df

Unnamed: 0,PLAYER,SALARY,GP,G,A,SOT,PPG,P
0,Sergio Agüero\n Forward — Manchester City,$19.2m,16.0,14,3.0,34,13.12,209.98
1,Eden Hazard\n Midfield — Chelsea,$18.9m,21.0,8,4.0,17,13.05,274.04
2,Alexis Sánchez\n Forward — Arsenal,$17.6m,,12,7.0,29,11.19,223.86
3,Yaya Touré\n Midfield — Manchester City,$16.6m,18.0,7,1.0,19,10.99,197.91
4,Ángel Di María\n Midfield — Manchester United,$15.0m,13.0,3,,13,10.17,132.23
5,Santiago Cazorla\n Midfield — Arsenal,$14.8m,20.0,4,,20,9.97,
6,David Silva\n Midfield — Manchester City,$14.3m,15.0,6,2.0,11,10.35,155.26
7,Cesc Fàbregas\n Midfield — Chelsea,$14.0m,20.0,2,14.0,10,10.47,209.49
8,Saido Berahino\n Forward — West Brom,$13.8m,21.0,9,0.0,20,7.02,147.43
9,Steven Gerrard\n Midfield — Liverpool,$13.8m,20.0,5,1.0,11,7.5,150.01


In [7]:
#  Converting Column Names to Lowercase¶
df.columns = [c.lower() for c in df.columns]
df.tail(3)

Unnamed: 0,player,salary,gp,g,a,sot,ppg,p
7,Cesc Fàbregas\n Midfield — Chelsea,$14.0m,20,2,14,10,10.47,209.49
8,Saido Berahino\n Forward — West Brom,$13.8m,21,9,0,20,7.02,147.43
9,Steven Gerrard\n Midfield — Liverpool,$13.8m,20,5,1,11,7.5,150.01


In [8]:
#  Renaming Particular Columns
df = df.rename(columns={'p': 'points',
                        'gp': 'games',
                        'sot': 'shots_on_target',
                        'g': 'goals',
                        'pgp': 'points_per_game',
                        'a': 'assists',
                       })

In [9]:
df.tail(3)

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points
7,Cesc Fàbregas\n Midfield — Chelsea,$14.0m,20,2,14,10,10.47,209.49
8,Saido Berahino\n Forward — West Brom,$13.8m,21,9,0,20,7.02,147.43
9,Steven Gerrard\n Midfield — Liverpool,$13.8m,20,5,1,11,7.5,150.01


In [10]:
# Changing Values in a Column
df['salary'] = df['salary'].apply(lambda x: x.strip('$m'))

In [11]:
df.tail()

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points
5,Santiago Cazorla\n Midfield — Arsenal,14.8,20,4,,20,9.97,
6,David Silva\n Midfield — Manchester City,14.3,15,6,2.0,11,10.35,155.26
7,Cesc Fàbregas\n Midfield — Chelsea,14.0,20,2,14.0,10,10.47,209.49
8,Saido Berahino\n Forward — West Brom,13.8,21,9,0.0,20,7.02,147.43
9,Steven Gerrard\n Midfield — Liverpool,13.8,20,5,1.0,11,7.5,150.01


In [12]:
# Adding a New Column¶
df['team'] = pd.Series('', index = df.index)
# or
df.insert(loc=8, column='position', value='')
df.tail()

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
5,Santiago Cazorla\n Midfield — Arsenal,14.8,20,4,,20,9.97,,,
6,David Silva\n Midfield — Manchester City,14.3,15,6,2.0,11,10.35,155.26,,
7,Cesc Fàbregas\n Midfield — Chelsea,14.0,20,2,14.0,10,10.47,209.49,,
8,Saido Berahino\n Forward — West Brom,13.8,21,9,0.0,20,7.02,147.43,,
9,Steven Gerrard\n Midfield — Liverpool,13.8,20,5,1.0,11,7.5,150.01,,


In [13]:
# processing the 'player' column

def process_player_col(text):
    name, rest = text.split('\n')
    position, team = [x.strip() for x in rest.split(' — ')]
    return pd.Series([name, team, position])

In [14]:
df[['player', 'team', 'position']] = df.player.apply(process_player_col)

In [15]:
df.tail()

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
5,Santiago Cazorla,14.8,20,4,,20,9.97,,Midfield,Arsenal
6,David Silva,14.3,15,6,2.0,11,10.35,155.26,Midfield,Manchester City
7,Cesc Fàbregas,14.0,20,2,14.0,10,10.47,209.49,Midfield,Chelsea
8,Saido Berahino,13.8,21,9,0.0,20,7.02,147.43,Forward,West Brom
9,Steven Gerrard,13.8,20,5,1.0,11,7.5,150.01,Midfield,Liverpool


In [16]:
# Applying Functions to Multiple Columns

cols = ['player', 'position', 'team']
df[cols] = df[cols].applymap(lambda x: x.lower())
df.head()

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
0,sergio agüero,19.2,16.0,14,3.0,34,13.12,209.98,forward,manchester city
1,eden hazard,18.9,21.0,8,4.0,17,13.05,274.04,midfield,chelsea
2,alexis sánchez,17.6,,12,7.0,29,11.19,223.86,forward,arsenal
3,yaya touré,16.6,18.0,7,1.0,19,10.99,197.91,midfield,manchester city
4,Ángel di maría,15.0,13.0,3,,13,10.17,132.23,midfield,manchester united


In [22]:
# Counting Rows with NaNs¶

nans = df.shape[0] - df.dropna().shape[0]

In [23]:
nans # number of rows with missing values

3

In [25]:
# Selecting NaN Rows
# Selecting all rows that have NaNs in the `assists` column
df[df['assists'].isnull()]

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
4,Ángel di maría,15.0,13,3,,13,10.17,132.23,midfield,manchester united
5,santiago cazorla,14.8,20,4,,20,9.97,,midfield,arsenal


In [27]:
# Selecting non-NaN Rows¶
df[~df['assists'].isnull()]
# or 
df[df['assists'].notnull()]

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
0,sergio agüero,19.2,16.0,14,3,34,13.12,209.98,forward,manchester city
1,eden hazard,18.9,21.0,8,4,17,13.05,274.04,midfield,chelsea
2,alexis sánchez,17.6,,12,7,29,11.19,223.86,forward,arsenal
3,yaya touré,16.6,18.0,7,1,19,10.99,197.91,midfield,manchester city
6,david silva,14.3,15.0,6,2,11,10.35,155.26,midfield,manchester city
7,cesc fàbregas,14.0,20.0,2,14,10,10.47,209.49,midfield,chelsea
8,saido berahino,13.8,21.0,9,0,20,7.02,147.43,forward,west brom
9,steven gerrard,13.8,20.0,5,1,11,7.5,150.01,midfield,liverpool


In [28]:
# Filling NaN cells with default value 0

df.fillna(value=0, inplace=True)
df

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
0,sergio agüero,19.2,16,14,3,34,13.12,209.98,forward,manchester city
1,eden hazard,18.9,21,8,4,17,13.05,274.04,midfield,chelsea
2,alexis sánchez,17.6,0,12,7,29,11.19,223.86,forward,arsenal
3,yaya touré,16.6,18,7,1,19,10.99,197.91,midfield,manchester city
4,Ángel di maría,15.0,13,3,0,13,10.17,132.23,midfield,manchester united
5,santiago cazorla,14.8,20,4,0,20,9.97,0.0,midfield,arsenal
6,david silva,14.3,15,6,2,11,10.35,155.26,midfield,manchester city
7,cesc fàbregas,14.0,20,2,14,10,10.47,209.49,midfield,chelsea
8,saido berahino,13.8,21,9,0,20,7.02,147.43,forward,west brom
9,steven gerrard,13.8,20,5,1,11,7.5,150.01,midfield,liverpool


In [31]:
# Adding an "empty" row to the DataFrame
import numpy as np
df = df.append(pd.Series([np.nan]* len(df.columns), index=df.columns
                        ),
               ignore_index=True
              )
df.tail(3)

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
9,steven gerrard,13.8,20.0,5.0,1.0,11.0,7.5,150.01,midfield,liverpool
10,,,,,,,,,,
11,,,,,,,,,,


In [35]:
df = df[:11] # executed twice by mistake

In [37]:
df.tail()

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
6,david silva,14.3,15.0,6.0,2.0,11.0,10.35,155.26,midfield,manchester city
7,cesc fàbregas,14.0,20.0,2.0,14.0,10.0,10.47,209.49,midfield,chelsea
8,saido berahino,13.8,21.0,9.0,0.0,20.0,7.02,147.43,forward,west brom
9,steven gerrard,13.8,20.0,5.0,1.0,11.0,7.5,150.01,midfield,liverpool
10,,,,,,,,,,


In [38]:
# Filling cells with data

df.loc[df.index[-1], 'player'] = 'new player'
df.loc[df.index[-1], 'salary'] = 12.3

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [39]:
df.tail()

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
6,david silva,14.3,15.0,6.0,2.0,11.0,10.35,155.26,midfield,manchester city
7,cesc fàbregas,14.0,20.0,2.0,14.0,10.0,10.47,209.49,midfield,chelsea
8,saido berahino,13.8,21.0,9.0,0.0,20.0,7.02,147.43,forward,west brom
9,steven gerrard,13.8,20.0,5.0,1.0,11.0,7.5,150.01,midfield,liverpool
10,new player,12.3,,,,,,,,


In [40]:
# Sorting the DataFrame by a certain column (from highest to lowest)
df.sort('goals', ascending=False, inplace=True)

  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  inplace=inplace, kind=kind, na_position=na_position)


In [41]:
df.head()

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
0,sergio agüero,19.2,16,14,3,34,13.12,209.98,forward,manchester city
2,alexis sánchez,17.6,0,12,7,29,11.19,223.86,forward,arsenal
8,saido berahino,13.8,21,9,0,20,7.02,147.43,forward,west brom
1,eden hazard,18.9,21,8,4,17,13.05,274.04,midfield,chelsea
3,yaya touré,16.6,18,7,1,19,10.99,197.91,midfield,manchester city


In [42]:
# Optional reindexing of the DataFrame after sorting
df.index = range(1, len(df.index)+1)
df.head()

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
1,sergio agüero,19.2,16,14,3,34,13.12,209.98,forward,manchester city
2,alexis sánchez,17.6,0,12,7,29,11.19,223.86,forward,arsenal
3,saido berahino,13.8,21,9,0,20,7.02,147.43,forward,west brom
4,eden hazard,18.9,21,8,4,17,13.05,274.04,midfield,chelsea
5,yaya touré,16.6,18,7,1,19,10.99,197.91,midfield,manchester city


In [43]:
# Creating a dummy DataFrame with changes in the `salary` column
df_2 = df.copy()
df_2.loc[0:2, 'salary'] = [20.0, 15.0]
df_2.head(3)

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
1,sergio agüero,20.0,16,14,3,34,13.12,209.98,forward,manchester city
2,alexis sánchez,15.0,0,12,7,29,11.19,223.86,forward,arsenal
3,saido berahino,13.8,21,9,0,20,7.02,147.43,forward,west brom


In [44]:
# Temporarily use the `player` columns as indices to 
# apply the update functions

df.set_index('player', inplace=True)
df_2.set_index('player', inplace=True)
df.head(3)

Unnamed: 0_level_0,salary,games,goals,assists,shots_on_target,ppg,points,position,team
player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
sergio agüero,19.2,16,14,3,34,13.12,209.98,forward,manchester city
alexis sánchez,17.6,0,12,7,29,11.19,223.86,forward,arsenal
saido berahino,13.8,21,9,0,20,7.02,147.43,forward,west brom


In [45]:
# Update the `salary` column
df.update(other=df_2['salary'], overwrite=True)
df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  mask, this, that, raise_on_error=True)


Unnamed: 0_level_0,salary,games,goals,assists,shots_on_target,ppg,points,position,team
player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
sergio agüero,20.0,16,14,3,34,13.12,209.98,forward,manchester city
alexis sánchez,15.0,0,12,7,29,11.19,223.86,forward,arsenal
saido berahino,13.8,21,9,0,20,7.02,147.43,forward,west brom


In [46]:
# Reset the indices
df.reset_index(inplace=True)
df.head(3)

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
0,sergio agüero,20.0,16,14,3,34,13.12,209.98,forward,manchester city
1,alexis sánchez,15.0,0,12,7,29,11.19,223.86,forward,arsenal
2,saido berahino,13.8,21,9,0,20,7.02,147.43,forward,west brom


In [47]:
# Selecting only those players that either playing for Arsenal or Chelsea
df[ (df['team'] == 'arsenal') | (df['team'] == 'chelsea')]

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
1,alexis sánchez,15.0,0,12,7,29,11.19,223.86,forward,arsenal
3,eden hazard,18.9,21,8,4,17,13.05,274.04,midfield,chelsea
7,santiago cazorla,14.8,20,4,0,20,9.97,0.0,midfield,arsenal
9,cesc fàbregas,14.0,20,2,14,10,10.47,209.49,midfield,chelsea


In [49]:
# Selecting forwards from Arsenal only
df[ (df['team'] == 'arsenal') & (df['position'] == 'forward')]

Unnamed: 0,player,salary,games,goals,assists,shots_on_target,ppg,points,position,team
1,alexis sánchez,15,0,12,7,29,11.19,223.86,forward,arsenal


In [52]:
df.columns.to_series().groupby(df.dtypes).groups

{dtype('float64'): ['games',
  'goals',
  'assists',
  'shots_on_target',
  'ppg',
  'points'],
 dtype('O'): ['player', 'salary', 'position', 'team']}

In [55]:
# select string columns
df.loc[:, (df.dtypes==np.dtype('O')).values].head()

Unnamed: 0,player,salary,position,team
0,sergio agüero,20.0,forward,manchester city
1,alexis sánchez,15.0,forward,arsenal
2,saido berahino,13.8,forward,west brom
3,eden hazard,18.9,midfield,chelsea
4,yaya touré,16.6,midfield,manchester city


In [59]:
(df.dtypes==np.dtype('O')).values

array([ True,  True, False, False, False, False, False, False,  True,  True], dtype=bool)

In [61]:
# so it's the same as this (applying boolean masks)
df.loc[:, [ True,  True, False, False, False, False, False, False,  True,  True]].head()

Unnamed: 0,player,salary,position,team
0,sergio agüero,20.0,forward,manchester city
1,alexis sánchez,15.0,forward,arsenal
2,saido berahino,13.8,forward,west brom
3,eden hazard,18.9,midfield,chelsea
4,yaya touré,16.6,midfield,manchester city


In [62]:
df['salary'] = df['salary'].astype('float')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [63]:
types = df.columns.to_series().groupby(df.dtypes).groups

In [64]:
types

{dtype('float64'): ['salary',
  'games',
  'goals',
  'assists',
  'shots_on_target',
  'ppg',
  'points'],
 dtype('O'): ['player', 'position', 'team']}

I was recently asked how to do an if-test in pandas, that is, how to create an array of 1s and 0s depending on a condition, e.g., if val less than 0.5 -> 0, else -> 1. Using the boolean mask, that's pretty simple since True and False are integers after all.

In [65]:
int(True)

1

In [66]:
a = [[2., .3, 4., 5.], [.8, .03, 0.02, 5.]]
df = pd.DataFrame(a)
df

Unnamed: 0,0,1,2,3
0,2.0,0.3,4.0,5
1,0.8,0.03,0.02,5


In [67]:
df = df<=0.05
df

Unnamed: 0,0,1,2,3
0,False,False,False,False
1,False,True,True,False


In [68]:
df.astype(int)

Unnamed: 0,0,1,2,3
0,0,0,0,0
1,0,1,1,0
