# Lesson 4 Practice: Pandas Part 2

Use this notebook to follow along with the lesson in the corresponding lesson notebook: [L04-Pandas_Part2-Lesson.ipynb](./L04-Pandas_Part2-Lesson.ipynb).  


## Instructions
Follow along with the teaching material in the lesson. Throughout the tutorial sections labeled as "Tasks" are interspersed and indicated with the icon: ![Task](http://icons.iconarchive.com/icons/sbstnblnd/plateau/16/Apps-gnome-info-icon.png). You should follow the instructions provided in these sections by performing them in the practice notebook.  When the tutorial is completed you can turn in the final practice notebook. For each task, use the cell below it to write and test your code.  You may add additional cells for any task as needed or desired.  

## Task 1a: Setup

- import pandas
- re-create the `df` data frame
- re-create the `iris_df` data frame

In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.DataFrame(
{'alpha': [0, 1, 2, 3, 4],
 'beta': ['a', 'b', 'c', 'd', 'e']})
df

Unnamed: 0,alpha,beta
0,0,a
1,1,b
2,2,c
3,3,d
4,4,e


In [4]:
iris_df = pd.read_csv('data/iris.csv')
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


## Task 2a: Inserting Columns

+ Create a copy of the `df` dataframe.
+ Add a new column named "delta" to the copy that consists of random numbers.

In [7]:
df['delta'] = np.random.random([5])
df

Unnamed: 0,alpha,beta,delta
0,0,a,0.951307
1,1,b,0.723017
2,2,c,0.755553
3,3,d,0.204874
4,4,e,0.179613


## Task 3a: Missing Data

+ Create two new copies of the `df` dataframe:
+ Add a new column to both that has missing values.
+ In one copy, replace missing values with a value of your choice.
+ In the other copy, drop rows with `NaN` values.
+ Print both arrays to confirm.

In [8]:
df['gamma'] = pd.Series([2,5,7, np.nan, 8])
df

Unnamed: 0,alpha,beta,delta,gamma
0,0,a,0.951307,2.0
1,1,b,0.723017,5.0
2,2,c,0.755553,7.0
3,3,d,0.204874,
4,4,e,0.179613,8.0


In [14]:
a = df.fillna(100)
a

Unnamed: 0,alpha,beta,delta,gamma
0,0,a,0.951307,2.0
1,1,b,0.723017,5.0
2,2,c,0.755553,7.0
3,3,d,0.204874,100.0
4,4,e,0.179613,8.0


In [15]:
df['theta'] = pd.Series([1,6,9, np.nan, 8])
df

Unnamed: 0,alpha,beta,delta,gamma,theta
0,0,a,0.951307,2.0,1.0
1,1,b,0.723017,5.0,6.0
2,2,c,0.755553,7.0,9.0
3,3,d,0.204874,,
4,4,e,0.179613,8.0,8.0


In [17]:
b = df.dropna()
b

Unnamed: 0,alpha,beta,delta,gamma,theta
0,0,a,0.951307,2.0,1.0
1,1,b,0.723017,5.0,6.0
2,2,c,0.755553,7.0,9.0
4,4,e,0.179613,8.0,8.0


## Task 4a: Operations
<span style="float:right; margin-left:10px; clear:both;">![Task](./media/task-icon.png)</span>

View the [Computational tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html) and [statistical methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#method-summary) documentation.
Using the list of operational functions choose five functions to use with the iris data frame.



In [18]:
iris_df.mean()

sepal_length    5.843333
sepal_width     3.054000
petal_length    3.758667
petal_width     1.198667
dtype: float64

In [23]:
iris_df.mean(1)

0      2.550
1      2.375
2      2.350
3      2.350
4      2.550
       ...  
145    4.300
146    3.925
147    4.175
148    4.325
149    3.950
Length: 150, dtype: float64

In [25]:
iris_df.min(1)

0      0.2
1      0.2
2      0.2
3      0.2
4      0.2
      ... 
145    2.3
146    1.9
147    2.0
148    2.3
149    1.8
Length: 150, dtype: float64

In [26]:
iris_df.std(1)

0      2.179449
1      2.036950
2      1.997498
3      1.912241
4      2.156386
         ...   
145    2.021551
146    2.075853
147    2.046745
148    1.791415
149    1.884144
Length: 150, dtype: float64

In [27]:
iris_df.var(1)

0      4.750000
1      4.149167
2      3.990000
3      3.656667
4      4.650000
         ...   
145    4.086667
146    4.309167
147    4.189167
148    3.209167
149    3.550000
Length: 150, dtype: float64

In [30]:
iris_df.count(1)

0      5
1      5
2      5
3      5
4      5
      ..
145    5
146    5
147    5
148    5
149    5
Length: 150, dtype: int64

## Task 4b:  Apply

Practice using `apply` on either the `df` or `iris_df` data frames using any two functions of your choice other than `print`, `type`, and `np.sum`.

In [31]:
help(df.apply)

Help on method apply in module pandas.core.frame:

apply(func, axis=0, raw=False, result_type=None, args=(), **kwds) method of pandas.core.frame.DataFrame instance
    Apply a function along an axis of the DataFrame.
    
    Objects passed to the function are Series objects whose index is
    either the DataFrame's index (``axis=0``) or the DataFrame's columns
    (``axis=1``). By default (``result_type=None``), the final return type
    is inferred from the return type of the applied function. Otherwise,
    it depends on the `result_type` argument.
    
    Parameters
    ----------
    func : function
        Function to apply to each column or row.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Axis along which the function is applied:
    
        * 0 or 'index': apply function to each column.
        * 1 or 'columns': apply function to each row.
    
    raw : bool, default False
        Determines if row or column is passed as a Series or ndarray object:
    
    

In [33]:
df.apply(np.sum)

alpha         10
beta       abcde
delta    2.81436
gamma         22
theta         24
dtype: object

In [35]:
iris_df.apply(np.sum)

sepal_length                                                876.5
sepal_width                                                 458.1
petal_length                                                563.8
petal_width                                                 179.8
species         setosasetosasetosasetosasetosasetosasetosaseto...
dtype: object

## Task 4c.  Occurances
Ientify the number of occurances for each species (virginica, versicolor, setosa) in the `iris_df` object.  *Hint*: the `value_counts` function only works on a `pd.Series` object, not on the full data frame..

In [37]:
pd.value_counts(iris_df['species'])

virginica     50
versicolor    50
setosa        50
Name: species, dtype: int64

## Task 5a: String Methods

+ Create a list of five strings that represent dates in the form YYYY-MM-DD (e.g. 2020-02-20 for Feb 20th, 2020).
+ Add this list of dates as a new column in the `df` dataframe.
+ Now split the date into 3 new columns with one column representing the year, another the month and another they day.
+ Combine the values from columns `alpha` and `beta` into a new column where the values are spearated with a colon.


In [39]:
list = pd.date_range('20210301', periods=5)
list

DatetimeIndex(['2021-03-01', '2021-03-02', '2021-03-03', '2021-03-04',
               '2021-03-05'],
              dtype='datetime64[ns]', freq='D')

In [40]:
df['date'] = list
df

Unnamed: 0,alpha,beta,date
0,0,a,2021-03-01
1,1,b,2021-03-02
2,2,c,2021-03-03
3,3,d,2021-03-04
4,4,e,2021-03-05


In [70]:
df[['year', 'month', 'day']] = df['date'].astype(str).str.split('-',expand=True)
df

Unnamed: 0,alpha,beta,date,year,month,day
0,0,a,2021-03-01,2021,3,1
1,1,b,2021-03-02,2021,3,2
2,2,c,2021-03-03,2021,3,3
3,3,d,2021-03-04,2021,3,4
4,4,e,2021-03-05,2021,3,5


In [77]:
df['combined'] = df['alpha'].astype(str) + ':' + df['beta']
df

Unnamed: 0,alpha,beta,date,year,month,day,combined
0,0,a,2021-03-01,2021,3,1,0:a
1,1,b,2021-03-02,2021,3,2,1:b
2,2,c,2021-03-03,2021,3,3,2:c
3,3,d,2021-03-04,2021,3,4,3:d
4,4,e,2021-03-05,2021,3,5,4:e


## Task 6a: Concatenation by Rows
+ Create the following dataframe
```Python
df1 = pd.DataFrame(
    {'alpha': [0, 1, 2, 3, 4],
     'beta': ['a', 'b', 'c', 'd', 'e']}, index = ['I1', 'I2' ,'I3', 'I4', 'I5'])
```
+ Create a new dataframe named `df2` with column names "delta" and "gamma" that contins 5 rows with some index names that overlap with the `df1` dataframe and some that do not.
+ Concatenate the two dataframes by rows and print the result.
+ You should see the two have combined one after the other, but there should also be missing values added. 
+ Explain why there are missing values.


In [75]:
df1 = pd.DataFrame(
  {'alpha': [0, 1, 2, 3, 4],
   'beta': ['a', 'b', 'c', 'd', 'e']}, index = ['I1', 'I2' ,'I3', 'I4', 'I5'])
df1

Unnamed: 0,alpha,beta
I1,0,a
I2,1,b
I3,2,c
I4,3,d
I5,4,e


In [76]:
df2 = pd.DataFrame(
  {'delta': [0, 3, 5, 7, 9],
   'gamma': ['e', 'i', 'a', 's', 'p']}, index = ['I4', 'I5', 'I6', 'I7' ,'I8',])
df2

Unnamed: 0,delta,gamma
I4,0,e
I5,3,i
I6,5,a
I7,7,s
I8,9,p


In [80]:
df3 = pd.concat([df1, df2], axis = 0)
df3 # missing values are there because all the labels are not defined in two data frames

Unnamed: 0,alpha,beta,delta,gamma
I1,0.0,a,,
I2,1.0,b,,
I3,2.0,c,,
I4,3.0,d,,
I5,4.0,e,,
I4,,,0.0,e
I5,,,3.0,i
I6,,,5.0,a
I7,,,7.0,s
I8,,,9.0,p


## Task 6b: Concatenation by Columns

Using the same dataframes, df1 and df2, from Task 6a practice:
+ Concatenate the two by columns
+ Add a "delta" column to `df1` and concatenate by columns such that there are 5 columns in the merged dataframe.
+ Respond in writing to this question (add a new 'raw' cell to contain your answer). What will happen if using you had performed an inner join while concatenating?  
+ Try the concatenation with the inner join to see if you are correct.

In [81]:
df4 = pd.concat([df1, df2], axis = 1)
df4

Unnamed: 0,alpha,beta,delta,gamma
I1,0.0,a,,
I2,1.0,b,,
I3,2.0,c,,
I4,3.0,d,0.0,e
I5,4.0,e,3.0,i
I6,,,5.0,a
I7,,,7.0,s
I8,,,9.0,p


In [82]:
df1['delta'] = ['0', '3', '5', '7', '9']
df1

Unnamed: 0,alpha,beta,delta
I1,0,a,0
I2,1,b,3
I3,2,c,5
I4,3,d,7
I5,4,e,9


In [83]:
df5 = pd.concat([df1, df2], axis = 1)
df5

Unnamed: 0,alpha,beta,delta,delta.1,gamma
I1,0.0,a,0.0,,
I2,1.0,b,3.0,,
I3,2.0,c,5.0,,
I4,3.0,d,7.0,0.0,e
I5,4.0,e,9.0,3.0,i
I6,,,,5.0,a
I7,,,,7.0,s
I8,,,,9.0,p


In [84]:
df5 = pd.concat([df1, df2], axis = 1, join = 'inner')
df5

Unnamed: 0,alpha,beta,delta,delta.1,gamma
I4,3,d,7,0,e
I5,4,e,9,3,i


#### Task 6c: Concat and append data frames
<span style="float:right; margin-left:10px; clear:both;">![Task](./media/task-icon.png)</span>

+ Create a new 5x5 dataframe full of random numbers.
+ Create a new 5x10 dataframe full of 1's.
+ Append one to the other and print it.
+ Append a single Series of zeros to the end of the appended dataframe.


In [90]:
a = pd.DataFrame(np.random.random([5,5]))
a

Unnamed: 0,0,1,2,3,4
0,0.158922,0.868379,0.963972,0.52086,0.954869
1,0.234009,0.989831,0.328288,0.252992,0.649879
2,0.083962,0.926441,0.481029,0.558153,0.803322
3,0.638625,0.385074,0.315166,0.071627,0.192696
4,0.398121,0.8317,0.898634,0.172149,0.044637


In [91]:
b = pd.DataFrame(np.ones([5,10]))
b

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [93]:
c = a.append(b, ignore_index = True)
c

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.158922,0.868379,0.963972,0.52086,0.954869,,,,,
1,0.234009,0.989831,0.328288,0.252992,0.649879,,,,,
2,0.083962,0.926441,0.481029,0.558153,0.803322,,,,,
3,0.638625,0.385074,0.315166,0.071627,0.192696,,,,,
4,0.398121,0.8317,0.898634,0.172149,0.044637,,,,,
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
6,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [99]:
d = pd.Series(np.random.random(10))
d

0    0.957939
1    0.561682
2    0.378702
3    0.636695
4    0.239419
5    0.921099
6    0.620600
7    0.230159
8    0.282215
9    0.769106
dtype: float64

In [100]:
c.append(d, ignore_index = True)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.158922,0.868379,0.963972,0.52086,0.954869,,,,,
1,0.234009,0.989831,0.328288,0.252992,0.649879,,,,,
2,0.083962,0.926441,0.481029,0.558153,0.803322,,,,,
3,0.638625,0.385074,0.315166,0.071627,0.192696,,,,,
4,0.398121,0.8317,0.898634,0.172149,0.044637,,,,,
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
6,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Task 6d: Grouping

Demonstrate a `groupby`.

+ Create a new column with the label "region" in the iris data frame. This column will indicates geographic regions of the US where measurments were taken. Values should include:  'Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'. Use these randomly.
+ Use `groupby` to get a new data frame of means for each species in each region.
+ Add a `dev_stage` column by randomly selecting from the values "early" and "late".
+ Use `groupby` to get a new data frame of means for each species,in each region and each development stage.
+ Use the `count` function (just like you used the `mean` function) to identify how many rows in the table belong to each combination of species + region + developmental stage.

In [117]:
iris_df['region'] = np.random.choice(['Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'])
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,region
0,5.1,3.5,1.4,0.2,setosa,Southwest
1,4.9,3.0,1.4,0.2,setosa,Southwest
2,4.7,3.2,1.3,0.2,setosa,Southwest
3,4.6,3.1,1.5,0.2,setosa,Southwest
4,5.0,3.6,1.4,0.2,setosa,Southwest
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,Southwest
146,6.3,2.5,5.0,1.9,virginica,Southwest
147,6.5,3.0,5.2,2.0,virginica,Southwest
148,6.2,3.4,5.4,2.3,virginica,Southwest


In [118]:
groups = iris_df.groupby('region')
groups.mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Southwest,5.843333,3.054,3.758667,1.198667


In [119]:
iris_df['region'] = np.random.choice(['Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'])
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,region
0,5.1,3.5,1.4,0.2,setosa,Southeast
1,4.9,3.0,1.4,0.2,setosa,Southeast
2,4.7,3.2,1.3,0.2,setosa,Southeast
3,4.6,3.1,1.5,0.2,setosa,Southeast
4,5.0,3.6,1.4,0.2,setosa,Southeast
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,Southeast
146,6.3,2.5,5.0,1.9,virginica,Southeast
147,6.5,3.0,5.2,2.0,virginica,Southeast
148,6.2,3.4,5.4,2.3,virginica,Southeast


In [120]:
groups = iris_df.groupby('region')
groups.mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Southeast,5.843333,3.054,3.758667,1.198667


In [121]:
iris_df['region'] = np.random.choice(['Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'])
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,region
0,5.1,3.5,1.4,0.2,setosa,Northwest
1,4.9,3.0,1.4,0.2,setosa,Northwest
2,4.7,3.2,1.3,0.2,setosa,Northwest
3,4.6,3.1,1.5,0.2,setosa,Northwest
4,5.0,3.6,1.4,0.2,setosa,Northwest
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,Northwest
146,6.3,2.5,5.0,1.9,virginica,Northwest
147,6.5,3.0,5.2,2.0,virginica,Northwest
148,6.2,3.4,5.4,2.3,virginica,Northwest


In [122]:
groups = iris_df.groupby('region')
groups.mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Northwest,5.843333,3.054,3.758667,1.198667


In [125]:
iris_df['region'] = np.random.choice(['Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'])
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,region
0,5.1,3.5,1.4,0.2,setosa,Midwest
1,4.9,3.0,1.4,0.2,setosa,Midwest
2,4.7,3.2,1.3,0.2,setosa,Midwest
3,4.6,3.1,1.5,0.2,setosa,Midwest
4,5.0,3.6,1.4,0.2,setosa,Midwest
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,Midwest
146,6.3,2.5,5.0,1.9,virginica,Midwest
147,6.5,3.0,5.2,2.0,virginica,Midwest
148,6.2,3.4,5.4,2.3,virginica,Midwest


In [126]:
groups = iris_df.groupby('region')
groups.mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Midwest,5.843333,3.054,3.758667,1.198667


In [131]:
iris_df['region'] = np.random.choice(['Southeast', 'Northeast', 'Midwest', 'Southwest', 'Northwest'])
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,region
0,5.1,3.5,1.4,0.2,setosa,Northeast
1,4.9,3.0,1.4,0.2,setosa,Northeast
2,4.7,3.2,1.3,0.2,setosa,Northeast
3,4.6,3.1,1.5,0.2,setosa,Northeast
4,5.0,3.6,1.4,0.2,setosa,Northeast
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,Northeast
146,6.3,2.5,5.0,1.9,virginica,Northeast
147,6.5,3.0,5.2,2.0,virginica,Northeast
148,6.2,3.4,5.4,2.3,virginica,Northeast


In [132]:
groups = iris_df.groupby('region')
groups.mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Northeast,5.843333,3.054,3.758667,1.198667


In [136]:
iris_df['dev_stage'] = np.random.choice(['early', 'late'], iris_df.shape[0])
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,region,dev_stage
0,5.1,3.5,1.4,0.2,setosa,Northeast,early
1,4.9,3.0,1.4,0.2,setosa,Northeast,early
2,4.7,3.2,1.3,0.2,setosa,Northeast,late
3,4.6,3.1,1.5,0.2,setosa,Northeast,late
4,5.0,3.6,1.4,0.2,setosa,Northeast,early


In [139]:
groups = iris_df.groupby(['species', 'region', 'dev_stage'])
groups.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sepal_length,sepal_width,petal_length,petal_width
species,region,dev_stage,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
setosa,Northeast,early,16,16,16,16
setosa,Northeast,late,34,34,34,34
versicolor,Northeast,early,24,24,24,24
versicolor,Northeast,late,26,26,26,26
virginica,Northeast,early,27,27,27,27
virginica,Northeast,late,23,23,23,23


In [142]:
groups.count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sepal_length,sepal_width,petal_length,petal_width
species,region,dev_stage,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
setosa,Northeast,early,16,16,16,16
setosa,Northeast,late,34,34,34,34
versicolor,Northeast,early,24,24,24,24
versicolor,Northeast,late,26,26,26,26
virginica,Northeast,early,27,27,27,27
virginica,Northeast,late,23,23,23,23
