# Intro to Pandas

### Jan 25th, 2019

Topics covered : 
* Series & DataFrames
* Basic Summary methods
* Selecting & Filtering Data
* Creating New Variables
* Groupby operations
* Merge operations
* Reading Files
* Class Exercises


In [2]:
import pandas as pd
import numpy as np

# !pip install jupyter_contib_nbextensions

# Topics

## Series & DataFrames

In [2]:
sports = pd.Series(['football', 'basketball',' volleyball','tennis'])

population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

In [3]:
sports

0       football
1     basketball
2     volleyball
3         tennis
dtype: object

In [5]:
population

Germany           81.3
Belgium           11.3
France            64.3
United Kingdom    64.9
Netherlands       16.9
dtype: float64

In [6]:
countries

Unnamed: 0,country,population,area,capital
0,Belgium,11.3,30510,Brussels
1,France,64.3,671308,Paris
2,Germany,81.3,357050,Berlin
3,Netherlands,16.9,41526,Amsterdam
4,United Kingdom,64.9,244820,London


In [3]:
type(population)

pandas.core.series.Series

In [7]:
sports.index

RangeIndex(start=0, stop=4, step=1)

In [5]:
population.index

Index(['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'], dtype='object')

In [8]:
population['Belgium']

11.3

In [6]:
population.values

array([11.3, 64.3, 81.3, 16.9, 64.9])

In [7]:
population/100

Belgium           0.113
France            0.643
Germany           0.813
Netherlands       0.169
United Kingdom    0.649
dtype: float64

In [8]:
type(population.values)

numpy.ndarray

In [9]:
type(countries)

pandas.core.frame.DataFrame

In [10]:
countries

Unnamed: 0,area,capital,country,population
0,30510,Brussels,Belgium,11.3
1,671308,Paris,France,64.3
2,357050,Berlin,Germany,81.3
3,41526,Amsterdam,Netherlands,16.9
4,244820,London,United Kingdom,64.9


Accessing dataframe variables using the '.' operator

In [4]:
type(countries.area)

pandas.core.series.Series

In [5]:
countries.area.values

array([ 30510, 671308, 357050,  41526, 244820])

In [6]:
type(countries.capital.values)

numpy.ndarray

## Basic Methods

In [7]:
countries.columns

Index(['country', 'population', 'area', 'capital'], dtype='object')

In [8]:
countries.dtypes

country        object
population    float64
area            int64
capital        object
dtype: object

In [9]:
countries.head()

Unnamed: 0,country,population,area,capital
0,Belgium,11.3,30510,Brussels
1,France,64.3,671308,Paris
2,Germany,81.3,357050,Berlin
3,Netherlands,16.9,41526,Amsterdam
4,United Kingdom,64.9,244820,London


In [18]:
countries.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
area          5 non-null int64
capital       5 non-null object
country       5 non-null object
population    5 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 240.0+ bytes


In [19]:
countries.values

array([[30510, 'Brussels', 'Belgium', 11.3],
       [671308, 'Paris', 'France', 64.3],
       [357050, 'Berlin', 'Germany', 81.3],
       [41526, 'Amsterdam', 'Netherlands', 16.9],
       [244820, 'London', 'United Kingdom', 64.9]], dtype=object)

In [20]:
countries.info()

Unnamed: 0,area,population
count,5.0,5.0
mean,269042.8,47.74
std,264012.827994,31.519645
min,30510.0,11.3
25%,41526.0,16.9
50%,244820.0,64.3
75%,357050.0,64.9
max,671308.0,81.3


In [126]:
countries.capital.value_counts()

Berlin       1
Brussels     1
Amsterdam    1
Paris        1
London       1
Name: capital, dtype: int64

In [7]:
population

Germany           81.3
Belgium           11.3
France            64.3
United Kingdom    64.9
Netherlands       16.9
dtype: float64

In [14]:
population.reset_index()

Unnamed: 0,index,0
0,Germany,81.3
1,Belgium,11.3
2,France,64.3
3,United Kingdom,64.9
4,Netherlands,16.9


In [11]:
type(population.reset_index())

pandas.core.frame.DataFrame

In [6]:
countries.capital.value_counts().reset_index()

Unnamed: 0,index,capital
0,London,1
1,Amsterdam,1
2,Brussels,1
3,Berlin,1
4,Paris,1


## Selecting and Filtering Data

<div class="alert alert-warning">
<b>ATTENTION!</b>: <br><br>

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. <br><br> We now have to distuinguish between:

 <ul>
  <li>selection by **label**</li>
  <li>selection by **position**</li>
</ul>
</div>

In [3]:
df = pd.read_csv("train.csv")

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### `data[]` provides some convenience shortcuts 

Selecting a single column

In [49]:
df['Pclass']  # Can also use df.Pclass

0      3
1      1
2      3
3      1
4      3
5      3
6      1
7      3
8      3
9      2
10     3
11     1
12     3
13     3
14     3
15     2
16     3
17     2
18     3
19     3
20     2
21     2
22     3
23     1
24     3
25     3
26     3
27     1
28     3
29     3
      ..
861    2
862    1
863    3
864    2
865    2
866    2
867    1
868    3
869    3
870    3
871    1
872    1
873    3
874    2
875    3
876    3
877    3
878    3
879    1
880    2
881    3
882    3
883    2
884    3
885    3
886    2
887    1
888    3
889    1
890    3
Name: Pclass, Length: 891, dtype: int64

Selecting multiple columns

In [15]:
df[['Pclass','Sex']]

Unnamed: 0,Pclass,Sex
0,3,male
1,1,female
2,3,female
3,1,female
4,3,male
5,3,male
6,1,male
7,3,male
8,3,female
9,2,female


Keep in mind that when we select more than one column, the output is DataFrame and not a series. Hence the difference in formatting of the two outputs above





We can also use this syntax to select specific rows

In [16]:
df[3:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Systematic indexing with `loc` and `iloc`

When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
    
* `loc`: selection by label
* `iloc`: selection by position

These methods index the different dimensions of the frame:

* `df.loc[row_indexer, column_indexer]`
* `df.iloc[row_indexer, column_indexer]`

In [17]:
df.loc[4,'Fare']

8.0500000000000007

In [57]:
df.loc[df.Sex=='female']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
18,19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,1,0,345763,18.0000,,S


In [59]:
df.loc[df.Sex=='female','Fare']

1       71.2833
2        7.9250
3       53.1000
8       11.1333
9       30.0708
10      16.7000
11      26.5500
14       7.8542
15      16.0000
18      18.0000
19       7.2250
22       8.0292
24      21.0750
25      31.3875
28       7.8792
31     146.5208
32       7.7500
38      18.0000
39      11.2417
40       9.4750
41      21.0000
43      41.5792
44       7.8792
47       7.7500
49      17.8000
52      76.7292
53      26.0000
56      10.5000
58      27.7500
61      80.0000
         ...   
807      7.7750
809     53.1000
813     31.2750
816      7.9250
820     93.5000
823     12.4750
829     80.0000
830     14.4542
835     83.1583
842     31.0000
849     89.1042
852     15.2458
853     39.4000
854     26.0000
855      9.3500
856    164.8667
858     19.2583
862     25.9292
863     69.5500
865     13.0000
866     13.8583
871     52.5542
874     24.0000
875      7.2250
879     83.1583
880     26.0000
882     10.5167
885     29.1250
887     30.0000
888     23.4500
Name: Fare, Length: 314,

In [62]:
df.loc[df.Sex=='female',['Fare','Name','Sex']]

Unnamed: 0,Fare,Name,Sex
1,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
2,7.9250,"Heikkinen, Miss. Laina",female
3,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
8,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female
9,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",female
10,16.7000,"Sandstrom, Miss. Marguerite Rut",female
11,26.5500,"Bonnell, Miss. Elizabeth",female
14,7.8542,"Vestrom, Miss. Hulda Amanda Adolfina",female
15,16.0000,"Hewlett, Mrs. (Mary D Kingcome)",female
18,18.0000,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female


In [64]:
df.loc[df.Sex=='female'][['Fare','Name','Sex']]

Unnamed: 0,Fare,Name,Sex
1,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
2,7.9250,"Heikkinen, Miss. Laina",female
3,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
8,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female
9,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",female
10,16.7000,"Sandstrom, Miss. Marguerite Rut",female
11,26.5500,"Bonnell, Miss. Elizabeth",female
14,7.8542,"Vestrom, Miss. Hulda Amanda Adolfina",female
15,16.0000,"Hewlett, Mrs. (Mary D Kingcome)",female
18,18.0000,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female


iloc is based on the position of the elements

In [65]:
df.iloc[4]

PassengerId                           5
Survived                              0
Pclass                                3
Name           Allen, Mr. William Henry
Sex                                male
Age                                  35
SibSp                                 0
Parch                                 0
Ticket                           373450
Fare                               8.05
Cabin                               NaN
Embarked                              S
Name: 4, dtype: object

In [19]:
df.iloc[5:7]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S


In [22]:
df.iloc[5:7,'Fare']

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

In [23]:
df.iloc[5:7]['Fare']

5     8.4583
6    51.8625
Name: Fare, dtype: float64

In [24]:
df.iloc[[1,2,3,4]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The different indexing methods can also be used to assign data:

In [76]:
df2 = df.copy()

df2.loc[0,'Fare'] = -100.0

In [77]:
df2.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,-100.0,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Rows can be selected based on whether or not they satisfy a certain (boolean) condition

In [80]:
df[df.Fare>100]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
118,119,0,1,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.5208,B58 B60,C
195,196,1,1,"Lurette, Miss. Elise",female,58.0,0,0,PC 17569,146.5208,B80,C
215,216,1,1,"Newell, Miss. Madeleine",female,31.0,1,0,35273,113.275,D36,C
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
268,269,1,1,"Graham, Mrs. William Thompson (Edith Junkins)",female,58.0,0,1,PC 17582,153.4625,C125,S
269,270,1,1,"Bissette, Miss. Amelia",female,35.0,0,0,PC 17760,135.6333,C99,S
297,298,0,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S


## Creating New Variables

In [97]:
countries['newVar'] = [1,2,3,4,5]                   #Basic assignment
countries

Unnamed: 0,area,capital,country,population,newVar
0,30510,Brussels,Belgium,11.3,1
1,671308,Paris,France,64.3,2
2,357050,Berlin,Germany,81.3,3
3,41526,Amsterdam,Netherlands,16.9,4
4,244820,London,United Kingdom,64.9,5


In [99]:
countries['newVar'] = countries.population * 2  + countries.area**0.5   #Using existing columns
countries

Unnamed: 0,area,capital,country,population,newVar
0,30510,Brussels,Belgium,11.3,197.27112
1,671308,Paris,France,64.3,947.933876
2,357050,Berlin,Germany,81.3,760.13661
3,41526,Amsterdam,Netherlands,16.9,237.579292
4,244820,London,United Kingdom,64.9,624.592886


### Using apply

Apply is a very powerful method which can be used for making major data manipulation tasks

In [104]:
countries['CAPITAL'] = countries['capital'].apply(lambda x : x.upper())
countries

Unnamed: 0,area,capital,country,population,newVar,CAPITAL
0,30510,Brussels,Belgium,11.3,197.27112,BRUSSELS
1,671308,Paris,France,64.3,947.933876,PARIS
2,357050,Berlin,Germany,81.3,760.13661,BERLIN
3,41526,Amsterdam,Netherlands,16.9,237.579292,AMSTERDAM
4,244820,London,United Kingdom,64.9,624.592886,LONDON


In [87]:
def ageBucket(x):
    if x<18:
        return "A. <18"
    elif x<25:
        return "B. 18-25"
    elif x<45:
        return "C. 25-45"
    else:
        return "D. >45"
        

Apply can be used on a single column (Series object)

In [105]:
df['AgeBucket'] = df['Age'].apply(lambda x : ageBucket(x))
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeBucket,AgeBucket2
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,B. 18-25,B. 18-25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C. 25-45,C. 25-45
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,C. 25-45,C. 25-45
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C. 25-45,C. 25-45
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,C. 25-45,C. 25-45


It can also be used on an entire dataframe

In [107]:
df['AgeBucket2'] = df.apply(lambda x : ageBucket(x['Age']),axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeBucket,AgeBucket2
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,B. 18-25,B. 18-25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C. 25-45,C. 25-45
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,C. 25-45,C. 25-45
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C. 25-45,C. 25-45
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,C. 25-45,C. 25-45


Other derivative methods that you can look into : `map` and `applymap`

## Groupby Operations

### Some 'theory': the groupby operation (split-apply-combine)

The "group by" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

* **Splitting** the data into groups based on some criteria
* **Applying** a function to each group independently
* **Combining** the results into a data structure

<img src="pandas-tutorial-master/img/splitApplyCombine.png">

Similar to SQL `GROUP BY`

In [25]:
df.groupby('Sex')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f708bfaf748>

In [26]:
df.groupby("Sex").mean()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


In [115]:
df.groupby('Sex').max()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,AgeBucket,AgeBucket2
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
female,889,1,3,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",63.0,8,6,WE/P 5735,512.3292,D. >45,D. >45
male,891,1,3,"van Melkebeke, Mr. Philemon",80.0,8,5,WE/P 5735,512.3292,D. >45,D. >45


In [117]:
def getRange(x):
    
    minVal = np.min(x.Fare)
    maxVal = np.max(x.Fare)
    
    return maxVal - minVal


df.groupby('Pclass').apply(lambda x : getRange(x))

Pclass
1    512.3292
2     73.5000
3     69.5500
dtype: float64

Grouping on multiple columns

In [53]:
df.groupby(['Sex','Pclass']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1,469.212766,0.968085,34.611765,0.553191,0.457447,106.125798
female,2,443.105263,0.921053,28.722973,0.486842,0.605263,21.970121
female,3,399.729167,0.5,21.75,0.895833,0.798611,16.11881
male,1,455.729508,0.368852,41.281386,0.311475,0.278689,67.226127
male,2,447.962963,0.157407,30.740707,0.342593,0.222222,19.741782
male,3,455.51585,0.135447,26.507589,0.498559,0.224784,12.661633


In [130]:
df.groupby(['Sex','Pclass'])['Age'].mean()

Sex     Pclass
female  1         34.611765
        2         28.722973
        3         21.750000
male    1         41.281386
        2         30.740707
        3         26.507589
Name: Age, dtype: float64

In [17]:
df.groupby('Sex').agg('max')

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
female,889,1,3,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",63.0,8,6,WE/P 5735,512.3292
male,891,1,3,"van Melkebeke, Mr. Philemon",80.0,8,5,WE/P 5735,512.3292


In [24]:
df.groupby('Sex').agg({'PassengerId':'min', 'Age':'max','Fare':'sum'})

Unnamed: 0_level_0,PassengerId,Age,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,2,63.0,13966.6628
male,1,80.0,14727.2865


## Merge Operations

Merging with Pandas works pretty much the same as SQL. There are four merge methods:
1. Left
2. Right
3. Inner 
4. Outer

Basic syntax : pd.merge(left_dataframe, right_dataframe, left_on="some_column", right_on="some_column", how="left|right|inner|outer)`

In [48]:
population = pd.DataFrame({'country': ['Germany', 'Belgium', 'France', 
                        'United Kingdom', 'United States'],'population': [81.3, 11.3, 64.3, 64.9, 65.9]})

countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

In [40]:
population

Unnamed: 0,country,population
0,Germany,81.3
1,Belgium,11.3
2,France,64.3
3,United Kingdom,64.9
4,United States,65.9


In [41]:
countries

Unnamed: 0,country,population,area,capital
0,Belgium,11.3,30510,Brussels
1,France,64.3,671308,Paris
2,Germany,81.3,357050,Berlin
3,Netherlands,16.9,41526,Amsterdam
4,United Kingdom,64.9,244820,London


In a Left Merge we are mostly concerned with data on the LEFT side but we would like to add data from 
the RIGHT side if it has some of the same countries in this case.

In [42]:
pd.merge(left=population, right=countries, on="country", how="left")

Unnamed: 0,country,population_x,population_y,area,capital
0,Germany,81.3,81.3,357050.0,Berlin
1,Belgium,11.3,11.3,30510.0,Brussels
2,France,64.3,64.3,671308.0,Paris
3,United Kingdom,64.9,64.9,244820.0,London
4,United States,65.9,,,


In a Right Merge we are mostly concerned with data on the RIGHT side but we would like to add data from 
the LEFT side if it has some of the same countries in this case.

In [45]:
pd.merge(left=population, right=countries, on="country", how="right")

Unnamed: 0,country,population_x,population_y,area,capital
0,Germany,81.3,81.3,357050,Berlin
1,Belgium,11.3,11.3,30510,Brussels
2,France,64.3,64.3,671308,Paris
3,United Kingdom,64.9,64.9,244820,London
4,Netherlands,,16.9,41526,Amsterdam


With an Inner Merge, we chop up both dataframes and only glue the stuff that matches. If a country isn't in both 
dataframes, we don't keep it and we don't add NaN's. If no type of join is mentioned, then inner join is the 
default join. 

In [49]:
pd.merge(left=population, right=countries,on ='country')

Unnamed: 0,country,population_x,population_y,area,capital
0,Germany,81.3,81.3,357050,Berlin
1,Belgium,11.3,11.3,30510,Brussels
2,France,64.3,64.3,671308,Paris
3,United Kingdom,64.9,64.9,244820,London


In [50]:
pd.merge(left=population, right=countries,on ='country', how = "inner")

Unnamed: 0,country,population_x,population_y,area,capital
0,Germany,81.3,81.3,357050,Berlin
1,Belgium,11.3,11.3,30510,Brussels
2,France,64.3,64.3,671308,Paris
3,United Kingdom,64.9,64.9,244820,London


With an Outer Merge, we chop up both dataframes and keep everything from both sides. Then we toss in NaN's to fill
any blanks.

In [51]:
pd.merge(left=population, right=countries,on ='country', how = "outer")

Unnamed: 0,country,population_x,population_y,area,capital
0,Germany,81.3,81.3,357050.0,Berlin
1,Belgium,11.3,11.3,30510.0,Brussels
2,France,64.3,64.3,671308.0,Paris
3,United Kingdom,64.9,64.9,244820.0,London
4,United States,65.9,,,
5,Netherlands,,16.9,41526.0,Amsterdam


## Reading Files

In [79]:
sales_data = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv')

In [80]:
sales_data.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
4,Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69


In [81]:
# header = 0 denotes the first line of data. If nothing is mentioned about header, then header = 0 is default.
sales_data2 = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv', header = 0)

In [82]:
sales_data2.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
4,Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69


In [84]:
sales_data3 = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv', header = None)
sales_data3.head(5)

Unnamed: 0,0,1,2,3,4,5,6
0,name,birthday,customer,orderdate,product,units,unitprice
1,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
2,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
3,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
4,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01


In [85]:
sales_data = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv', usecols=['name', 'birthday'])
sales_data.head(5)

Unnamed: 0,name,birthday
0,Pasquale,1967-09-02
1,India,1968-12-13
2,Wayne,1992-09-10
3,Cori,1986-11-05
4,Chang,1972-04-23


In [91]:
sales_data = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv', header= None, skiprows=2)
sales_data.columns= ['name', 'birthday','customer','orderadate','product','units','unitprice']
sales_data.head(2)

Unnamed: 0,name,birthday,customer,orderadate,product,units,unitprice
0,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
1,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79


In [93]:
# The date parse is US datew friendly! MM/DD/YYYY


sales_data = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv',parse_dates=['birthday', 'orderdate'])
sales_data.head(2)                     

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69


In [3]:
# To use the more common international format for sure, add 'dayfirst=True'
sales_data = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv',parse_dates=['birthday', 'orderdate'], dayfirst=True)
sales_data.head(2) 

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69


In [4]:
sales_data.dtypes

name                 object
birthday     datetime64[ns]
customer             object
orderdate    datetime64[ns]
product              object
units                 int64
unitprice           float64
dtype: object

In [5]:
sales_data['modified_orderdate'] = sales_data['orderdate'].apply(lambda x: "%d/%d/%d" % (x.day, x.month, x.year))
sales_data.head(4)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice,modified_orderdate
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27,17/7/2016
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69,6/7/2016
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79,22/7/2016
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01,23/7/2016


In [6]:
sales_data.dtypes

name                          object
birthday              datetime64[ns]
customer                      object
orderdate             datetime64[ns]
product                       object
units                          int64
unitprice                    float64
modified_orderdate            object
dtype: object

In [7]:
sales_data['Hour'] = sales_data['orderdate'].apply(lambda x: "%d" % (x.hour))
sales_data.head(4)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice,modified_orderdate,Hour
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27,17/7/2016,13
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69,6/7/2016,13
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79,22/7/2016,13
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01,23/7/2016,13


In [8]:
sales_data["modified_orderdate"]= pd.to_datetime(sales_data["modified_orderdate"])
sales_data.head(4)
sales_data.dtypes

name                          object
birthday              datetime64[ns]
customer                      object
orderdate             datetime64[ns]
product                       object
units                          int64
unitprice                    float64
modified_orderdate    datetime64[ns]
Hour                          object
dtype: object

In [9]:
sales_data['birth_month'] = sales_data['birthday'].dt.month
sales_data.head(4)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice,modified_orderdate,Hour,birth_month
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27,2016-07-17,13,9
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69,2016-06-07,13,12
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79,2016-07-22,13,9
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01,2016-07-23,13,11


In [10]:
sales_data_json = pd.read_json('pandas-tutorial-master/data/blooth_sales_data.json')
sales_data_json.head(5)

Unnamed: 0,birthday,customer,name,orderdate,product,unitprice,units
0,1974-01-07,Frontier Industries,Ernesto,2016-10-06 08:21:20.544568,Star Wars,11.81,27
1,1986-02-05,Bell Telecom Limited,Queen,2016-09-30 08:21:20.544599,PlayStation,284.71,1
2,1982-07-06,Software Co,Reid,2016-10-05 08:21:20.544622,banana,10.0,49
3,1971-04-12,Data Design Galaxy Co,Arlene,2016-10-02 08:21:20.544643,Thriller record,16.77,48
4,1984-12-14,Frontier Inc,Nikita,2016-10-16 08:21:20.544666,Harry Potter book,5.65,4


## Missing Data
How to handle missing data (NaN's)? Most common commands used are fillna and dropna. 

In [126]:
missing_df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three'])
missing_df['four'] = 'bar'
missing_df['five'] = missing_df['one'] > 0
missing_df.loc[['a','c','h'],['one','four']] = np.nan
missing_df

Unnamed: 0,one,two,three,four,five
a,,-0.522296,0.607862,,True
c,,-0.104641,0.014973,,False
e,2.275265,1.220143,0.324177,bar,True
f,-0.348951,-1.410194,-0.680598,bar,False
h,,1.850061,0.003998,,False


In [127]:
# fillna replaces NA/NAN values with the given value in the command.
missing_df.fillna(0)

Unnamed: 0,one,two,three,four,five
a,0.0,-0.522296,0.607862,0,True
c,0.0,-0.104641,0.014973,0,False
e,2.275265,1.220143,0.324177,bar,True
f,-0.348951,-1.410194,-0.680598,bar,False
h,0.0,1.850061,0.003998,0,False


In [128]:
missing_df['one'].fillna('missing')

a     missing
c     missing
e     2.27526
f   -0.348951
h     missing
Name: one, dtype: object

Dropna is used to drop the rows or columns with NA/NAN values.
<br>
'axis' argument determines if rows or columns which contain missing values are removed.
<br>
'axis =0': Drop rows which contain missing values. 
<br>
'axis =1': Drop columns which contain missing value.
<br>


'how' argument determines if row or column is removed from DataFrame, when we have at least one NA or all NA.
<br>
‘how = any’ : If any NA values are present, drop that row or column. (default)
<br>
‘how = all’ : If all values are NA, drop that row or column.
<br>

In [129]:
missing_df.dropna(axis=0)

Unnamed: 0,one,two,three,four,five
e,2.275265,1.220143,0.324177,bar,True
f,-0.348951,-1.410194,-0.680598,bar,False


In [130]:
missing_df.dropna(axis=1)

Unnamed: 0,two,three,five
a,-0.522296,0.607862,True
c,-0.104641,0.014973,False
e,1.220143,0.324177,True
f,-1.410194,-0.680598,False
h,1.850061,0.003998,False


In [132]:
missing_df['six'] = np.nan
missing_df

Unnamed: 0,one,two,three,four,five,six
a,,-0.522296,0.607862,,True,
c,,-0.104641,0.014973,,False,
e,2.275265,1.220143,0.324177,bar,True,
f,-0.348951,-1.410194,-0.680598,bar,False,
h,,1.850061,0.003998,,False,


In [133]:
missing_df.dropna(axis=1, how = 'all')

Unnamed: 0,one,two,three,four,five
a,,-0.522296,0.607862,,True
c,,-0.104641,0.014973,,False
e,2.275265,1.220143,0.324177,bar,True
f,-0.348951,-1.410194,-0.680598,bar,False
h,,1.850061,0.003998,,False


In [135]:
#dropping rows only where some columns are missing
missing_df.dropna(subset = ['one', 'two', 'four'])

Unnamed: 0,one,two,three,four,five,six
e,2.275265,1.220143,0.324177,bar,True,
f,-0.348951,-1.410194,-0.680598,bar,False,


In [6]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Exercises

## Titanic

In [4]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Calculate the number of passengers with Pclass = 3

In [5]:
len(df[df.Pclass==3])

491

Compute the percentage of passengers that survived

In [7]:
len(df[df.Survived==1])/len(df)

0.3838383838383838

How many children below the age of 18?

In [8]:
len(df[df.Age<18])

113

Whats the ratio of male and female passengers?

In [48]:
df.groupby(np.ones(len(df))).apply(lambda x : np.sum(x.Sex=='male')/np.sum(x.Sex=='female') )

1.0    1.83758
dtype: float64

In [46]:
len(df[df.Sex=='male'])/len(df[df.Sex=='female'])

1.8375796178343948

Between the two genders, whats the ratio of passengers that survived?

In [37]:
df.groupby('Sex').apply(lambda x : np.mean(x.Survived) )

Sex
female    0.742038
male      0.188908
dtype: float64

In [32]:
df.groupby('Sex').agg({"Survived":'mean'})

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


Create a new variable which has 0 for male and 1 for female. Name this variable **LabelEncode_Sex**

In [31]:
df['LabelEncode_Sex'] = df['Sex'].apply(lambda x : 1 if x=='female' else 0)

Create a variable that takes the value of 1 when Pclass is 1 and 0 otherwise. Create similar variables for when Pclass has a value of 2 and 3.

Name these variables **OHE_PClass1, OHE_PClass2, OHE_PClass3** respectively 

In [None]:
df['OHE_Pclass1'] = df['Pclass'].apply(lambda x : 1 if x ==1 else 0)
df['OHE_Pclass2'] = df['Pclass'].apply(lambda x : 1 if x ==2 else 0)
df['OHE_Pclass3'] = df['Pclass'].apply(lambda x : 1 if x ==3 else 0)


Calculate the mean fare for all samples with an odd index

In [None]:
df.loc[df.index%2==1]['Fare'].mean()

Create a new variable which stores the last name of passengers

In [19]:
df['familyName'] = df['Name'].apply(lambda x : x.split(",")[0])
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,familyName
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen


Calculate the number of unique families ( based on last names)

In [22]:
len(list(df.familyName.value_counts()))

667

Create a variable that indicates the **size of the family** for each passenger. *Family size is the number of passengers with the same family name*

In [2]:
familySize = df.groupby('familyName')['PassengerId'].count().reset_index()

NameError: name 'df' is not defined

In [26]:
df['familySize'] = df['familyName'].apply(lambda x : len(df.loc[df.familyName == x]))

df.head()

Unnamed: 0,familyName,PassengerId
0,Abbing,1
1,Abbott,2
2,Abelson,2
3,Adahl,1
4,Adams,1
5,Ahlin,1
6,Aks,1
7,Albimona,1
8,Alexander,1
9,Alhomaki,1


#### Fare by Cabin Index

All cabin numbers begin with a letter. We hypothesize that this first letter actually has a significance. So create a new variable that stores the first letter of the cabin variable. Call this **CabinIndex**.

NOTE : The cabin variable has missing values. Also check for the data type of the Cabin variable.

Once you have created the CabinIndex variable, calculate the mean value of fare for different levels of CabinIndex

In [68]:
df['CabinIndex'] = df['Cabin'].apply(lambda x : str(x)[0] if pd.notnull(x) else np.NaN )

In [69]:
df.groupby('CabinIndex')['Fare'].mean()

CabinIndex
A     39.623887
B    113.505764
C    100.151341
D     57.244576
E     46.026694
F     18.696792
G     13.581250
T     35.500000
Name: Fare, dtype: float64

In [54]:
df.loc[df.CabinIndex=='n']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,CabinIndex
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,n
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,n
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,n
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,n
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S,n
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,n
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,n
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S,n
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S,n
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S,n


## Sales Data

For sales_data, create a variable named mean_units which is the average of all units when the birth month lies between Feb and August.

In [None]:
sales_data = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv')
sales_data.dtypes
sales_data["birthday"]= pd.to_datetime(sales_data["birthday"])
sales_data_mid = sales_data[((sales_data['birthday'].dt.month > 2) & (sales_data['birthday'].dt.month <8 ))]
mean_units =sales_data_mid["units"].mean()
mean_units

Create a new column in sales_data titled 'order_minutes' and for each row, store the minutes from orderdate 

In [None]:
sales_data["orderdate"]= pd.to_datetime(sales_data["orderdate"])
sales_data['order_minutes'] = sales_data['orderdate'].dt.minute
sales_data.head(2)

For sales_data dataframe, create a dataframe called 'sd_df' to store only those rows where product is 'Harry Potter book'

In [None]:
sales_data = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv')
sd_df = sales_data[sales_data["product"] == 'Harry Potter book' ]
sd_df.head(2)

For sales_data, find the data of people who were born before 1980

In [None]:
sales_data = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv')
sales_data["birthday"]= pd.to_datetime(sales_data["birthday"])
sales_data= sales_data[sales_data['birthday'].dt.year < 1980]
sales_data.head(2)

For sales_data, find the average unitprice for products that were ordered in first week of a month

In [None]:
sales_data = pd.read_csv('pandas-tutorial-master/data/blooth_sales_data.csv')
sales_data["orderdate"]= pd.to_datetime(sales_data["orderdate"])
sales_data= sales_data[sales_data['orderdate'].dt.day > 8]
sales_data.head(2)

Create a column named 'count_units' in the sales_data dataframe to store the number of units sold for each product

In [None]:
sales_data.groupby('product')['units'].sum().reset_index()

Create a new column in sales_data and store orderdate in the format mm/dd/yyyy

In [None]:
sales_data['new_date']=sales_data['orderdate'].apply(lambda x : "%d/%d/%d" % (x.month, x.day,x.year))

## Iris Dataset

In [49]:
## Loading the dataset

from sklearn.datasets import load_iris
data = load_iris()

In [50]:
data.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [51]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [52]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [16]:
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

### Exercises

Put together all the components of the data variable into a Pandas DataFrame. *This means putting together the feature and target variables, and adding their names as column names*

In [53]:
df = pd.DataFrame(data.data, columns = data.feature_names)

df['target'] = data.target

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Find number of observations in the dataset which belong to class setosa and have a petal length > 3

In [55]:
len(df[ (df.target==0) & (df['petal length (cm)'] >3 )])

0

Find the maximum and minimum values of each of features.

In [56]:
df.agg(['min','max'])

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
min,4.3,2.0,1.0,0.1,0
max,7.9,4.4,6.9,2.5,2


For each target class, find the range of values for all the features.

In [60]:
def getRange(x):
    return np.max(x) -np.min(x)

df.apply(lambda x : getRange(x),axis=0)

sepal length (cm)    3.6
sepal width (cm)     2.4
petal length (cm)    5.9
petal width (cm)     2.4
target               2.0
dtype: float64

For each of the target classes, find the mean value of each of the independent variables. The mean values should be represented in a table.

**Do not** use for loops. This should be doable in a single line of code

In [68]:
df.groupby('target').mean()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,5.006,3.428,1.462,0.246
1,5.936,2.77,4.26,1.326
2,6.588,2.974,5.552,2.026


In [70]:
#Alternate solution using apply within apply
def getRange(x):
        
    meanValues = x[x.columns[:4]].apply(lambda x : np.mean(x), axis=0)
    return(meanValues)
        
    

df.groupby('target').apply(lambda x : getRange(x))

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,5.006,3.428,1.462,0.246
1,5.936,2.77,4.26,1.326
2,6.588,2.974,5.552,2.026


In [71]:
#Another possible solution. But a solution using apply is always preferred over one using for loops
def getRange1(x):
    
    meanValues = []
    for col in df.columns[:4]:
        val = np.round(np.mean(x[col]),3)
        meanValues.append(val)
        
    return(meanValues)

df.groupby('target').apply(lambda x : getRange1(x))

target
0    [5.006, 3.428, 1.462, 0.246]
1      [5.936, 2.77, 4.26, 1.326]
2    [6.588, 2.974, 5.552, 2.026]
dtype: object