In [101]:
import numpy as np
import pandas as pd
import matplotlib as mt

# Pandas Basics
## read_csv Function

In [102]:
table = pd.read_csv("./machine_learning_examples-master/linear_regression_class/data_2d.csv", header= None)
table

Unnamed: 0,0,1,2
0,17.930201,94.520592,320.259530
1,97.144697,69.593282,404.634472
2,81.775901,5.737648,181.485108
3,55.854342,70.325902,321.773638
4,49.366550,75.114040,322.465486
5,3.192702,29.256299,94.618811
6,49.200784,86.144439,356.348093
7,21.882804,46.841505,181.653769
8,79.509863,87.397356,423.557743
9,88.153887,65.205642,369.229245


## Data Summary with Head/ Tail

In [103]:
table.head(5)

Unnamed: 0,0,1,2
0,17.930201,94.520592,320.25953
1,97.144697,69.593282,404.634472
2,81.775901,5.737648,181.485108
3,55.854342,70.325902,321.773638
4,49.36655,75.11404,322.465486


In [104]:
table.tail(5)

Unnamed: 0,0,1,2
95,46.456779,82.000171,336.876154
96,77.130301,95.188759,438.460586
97,68.600608,72.571181,355.900287
98,41.693887,69.241126,284.834637
99,4.142669,52.254726,168.034401


## Getting Acquainted with Data Frames

In [105]:
type(table)

pandas.core.frame.DataFrame

In [106]:
table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
0    100 non-null float64
1    100 non-null float64
2    100 non-null float64
dtypes: float64(3)
memory usage: 2.4 KB


## Selecting/ Subsetting Data From Data Frames

Data frame objects don't act as arrays or matrices - typing *dataframe[0,0] isn't going to get you the first element from the first row'*. One way of circumventing this is to convert a data frame into an array/ matrix:

In [107]:
M = table.as_matrix()

  """Entry point for launching an IPython kernel.


**NOTE:** Remember that matrices and array are Numpy objects, whereas Data frames are Pandas objects. Having converted the original data frame object using the *as_matrix* function essentially turns the data frame into a Numpy array, as can be seen below:

In [108]:
type(M)

numpy.ndarray

The method in which elements are selected in Numpy array and Pandas dataframe are fundamentally different, such that the same command will select different elements of the data table:
- In **'Numpy array' - M[0]** will give the **first row** of the table, whereas
- In **'Pandas dataframe' - M[0]** will output the values of **a column with the header name '0'**

See the example below for an illustration for the cases:
### Select Elements from a Numpy array/ matrix:

Select the first row of a matrix in Numpy:

In [109]:
M[0]

array([ 17.93020121,  94.52059195, 320.2595296 ])

Select the first row of the first column from a matrix in Numpy:

In [110]:
M[0,0]

17.9302012052

### Select Elements from a Pandas dataframe:
An illustration of selecting a Pandas dataframe:

In [111]:
table.head(5)

Unnamed: 0,0,1,2
0,17.930201,94.520592,320.25953
1,97.144697,69.593282,404.634472
2,81.775901,5.737648,181.485108
3,55.854342,70.325902,321.773638
4,49.36655,75.11404,322.465486


In [112]:
table[0].head(5)

0    17.930201
1    97.144697
2    81.775901
3    55.854342
4    49.366550
Name: 0, dtype: float64

Selecting a specific row and columns:
- **.iloc[]** selects a specific elements from the data frame using **positional indexing**
- **.loc[]** selects a specific elements from the data frame using **label-based indexing**

In [113]:
table.iloc[0]

0     17.930201
1     94.520592
2    320.259530
Name: 0, dtype: float64

In [114]:
table.iloc[0,2]

320.259529602

In [115]:
table.iloc[0:5,2]

0    320.259530
1    404.634472
2    181.485108
3    321.773638
4    322.465486
Name: 2, dtype: float64

In [116]:
type(table.iloc[0])

pandas.core.series.Series

In [117]:
table[[1,2]].head(5)

Unnamed: 0,1,2
0,94.520592,320.25953
1,69.593282,404.634472
2,5.737648,181.485108
3,70.325902,321.773638
4,75.11404,322.465486


In [118]:
type([1,2])

list

**Note:** In the above example, the additional bracket enclosing [1,2] converts the "1,2" statement into a 'list'

Conditional statement in selecting components of a data frame:

In [119]:
table[table[0] < 5]

Unnamed: 0,0,1,2
5,3.192702,29.256299,94.618811
44,3.593966,96.252217,293.237183
54,4.593463,46.335932,145.818745
90,1.382983,84.944087,252.905653
99,4.142669,52.254726,168.034401


The statement **In[1]: table[table[0] < 5]** basically tranlates to " From the 'table' dataframe, select all elements of the dataframe whose column '0' possess a value of less than 5"

Doing just the **table[0] < 5** returns a boolean object (True/ False)

In [120]:
table[0] < 5

0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
      ...  
70    False
71    False
72    False
73    False
74    False
75    False
76    False
77    False
78    False
79    False
80    False
81    False
82    False
83    False
84    False
85    False
86    False
87    False
88    False
89    False
90     True
91    False
92    False
93    False
94    False
95    False
96    False
97    False
98    False
99     True
Name: 0, Length: 100, dtype: bool

In [121]:
table[table.iloc[0:5] < 5].head(5)

Unnamed: 0,0,1,2
0,,,
1,,,
2,,,
3,,,
4,,,


## Column Names in Panda Dataframes

The dataset used in this exercise possesses the following irrelevant information:
- A 3-row footer summary that does not comply with tidy data concepts, and

The skipfooter argument does not work with Anaconda Python's C-based language, so passing the **engine="python"** argument is necessary to tell the system that we are running this command based on the Python engine

In [122]:
airline = pd.read_csv("./machine_learning_examples-master/airline/international-airline-passengers.csv", engine="python", 
                      skipfooter=3)
airline.head(5)

Unnamed: 0,Month,International airline passengers: monthly totals in thousands. Jan 49 ? Dec 60
0,1949-01,112
1,1949-02,118
2,1949-03,132
3,1949-04,129
4,1949-05,121


## Checking and Assigning Column Names

In [123]:
airline.columns

Index(['Month', 'International airline passengers: monthly totals in thousands. Jan 49 ? Dec 60'], dtype='object')

In [124]:
airline.columns = ["month", "passengers"]
airline.columns

Index(['month', 'passengers'], dtype='object')

In [125]:
airline.head(5)

Unnamed: 0,month,passengers
0,1949-01,112
1,1949-02,118
2,1949-03,132
3,1949-04,129
4,1949-05,121


### Adding New Columns to a Dataframe

In [126]:
airline['ones'] = 1
airline.head(5)

Unnamed: 0,month,passengers,ones
0,1949-01,112,1
1,1949-02,118,1
2,1949-03,132,1
3,1949-04,129,1
4,1949-05,121,1


### The apply() Function

The apply() function is basically used to apply a single function repeatedly to components of a table. Basic arguments for this function are structured in the following manner:

**tablename.apply(function_name, axis = 1 or 2)**

Pass the argument axis = 1, so that the function gets applied across each row instead of columns. Sample case:

The above is the equivalent of **('lambda' is a way to integrate single-use functions into an argument):**

And the above is the equivalent of (with the use of for loops - **do not actually do this. For loop takes quite awhile to run in Python**):

**Example case:**

In [127]:
from datetime import datetime

In [128]:
datetime.strptime("1949-05", "%Y-%m")

datetime.datetime(1949, 5, 1, 0, 0)

The following code line basically applies the .strptime() function from the 'datetime' package to each row from the ['month'] column, converting it into a "%Y-%m" format, following the axis = 1 argument

In [129]:
airline['dt'] = airline.apply(lambda row: datetime.strptime(row['month'],"%Y-%m"), axis = 1)

In [130]:
airline.head(5)

Unnamed: 0,month,passengers,ones,dt
0,1949-01,112,1,1949-01-01
1,1949-02,118,1,1949-02-01
2,1949-03,132,1,1949-03-01
3,1949-04,129,1,1949-04-01
4,1949-05,121,1,1949-05-01


In [131]:
airline.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 4 columns):
month         144 non-null object
passengers    144 non-null int64
ones          144 non-null int64
dt            144 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 4.6+ KB


### Joining a Column/ Other Dataframe to Another Dataframe

In [132]:
# Reading the two tables for this exercise:

table1 = pd.read_csv("./machine_learning_examples-master/numpy_class/table1.csv")
table2 = pd.read_csv("./machine_learning_examples-master/numpy_class/table2.csv")

In [133]:
table1.info()
table2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
user_id    3 non-null int64
email      3 non-null object
age        3 non-null int64
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
user_id    11 non-null int64
ad_id      11 non-null int64
click      11 non-null int64
dtypes: int64(3)
memory usage: 344.0 bytes


In [134]:
table1

Unnamed: 0,user_id,email,age
0,1,alice@gmail.com,20
1,2,bob@gmail.com,25
2,3,carol@gmail.com,30


In [135]:
table2

Unnamed: 0,user_id,ad_id,click
0,1,1,1
1,1,2,0
2,1,5,0
3,2,3,0
4,2,4,1
5,2,1,0
6,3,2,0
7,3,1,0
8,3,3,0
9,3,4,0


In [136]:
m1 = pd.merge(table1, table2, on = 'user_id')

In [137]:
m1

Unnamed: 0,user_id,email,age,ad_id,click
0,1,alice@gmail.com,20,1,1
1,1,alice@gmail.com,20,2,0
2,1,alice@gmail.com,20,5,0
3,2,bob@gmail.com,25,3,0
4,2,bob@gmail.com,25,4,1
5,2,bob@gmail.com,25,1,0
6,3,carol@gmail.com,30,2,0
7,3,carol@gmail.com,30,1,0
8,3,carol@gmail.com,30,3,0
9,3,carol@gmail.com,30,4,0


In [138]:
m1.info

<bound method DataFrame.info of     user_id            email  age  ad_id  click
0         1  alice@gmail.com   20      1      1
1         1  alice@gmail.com   20      2      0
2         1  alice@gmail.com   20      5      0
3         2    bob@gmail.com   25      3      0
4         2    bob@gmail.com   25      4      1
5         2    bob@gmail.com   25      1      0
6         3  carol@gmail.com   30      2      0
7         3  carol@gmail.com   30      1      0
8         3  carol@gmail.com   30      3      0
9         3  carol@gmail.com   30      4      0
10        3  carol@gmail.com   30      5      1>

In [139]:
m1.iloc[0:4]

Unnamed: 0,user_id,email,age,ad_id,click
0,1,alice@gmail.com,20,1,1
1,1,alice@gmail.com,20,2,0
2,1,alice@gmail.com,20,5,0
3,2,bob@gmail.com,25,3,0


## .values vs .as_matrix()

In [141]:
df = pd.DataFrame([[1,2],[3,4]])
df

Unnamed: 0,0,1
0,1,2
1,3,4


In [142]:
df.as_matrix()

  """Entry point for launching an IPython kernel.


array([[1, 2],
       [3, 4]], dtype=int64)

**NOTE:** note the warning message above, that the method .as_matrix() will be removed in future versions of python, and that we should use the **.values** instead

In [144]:
df.values

array([[1, 2],
       [3, 4]], dtype=int64)

**NOTE:** A key message here is to not be afraid of changes. **Learn the principles, not the syntax** - changes happen all the time, a case example on the structure of for loops in Python 2 and 3:
- Python 2:
    
    for i in xrange(N):
        print x[i]
- Python 3:
   
    for i in range(N):
        print(x[i])
The principles for both lines of codes are the same: we want a generator instead of instantiating a list