# Lesson 3 Practice: Pandas Part 1

Use this notebook to follow along with the lesson in the corresponding lesson notebook: [L03-Pandas_Part1-Lesson.ipynb](./L03-Pandas_Part1-Lesson.ipynb).  

## Instructions
Follow along with the teaching material in the lesson. Throughout the tutorial sections labeled as "Tasks" are interspersed and indicated with the icon: ![Task](http://icons.iconarchive.com/icons/sbstnblnd/plateau/16/Apps-gnome-info-icon.png). You should follow the instructions provided in these sections by performing them in the practice notebook.  When the tutorial is completed you can turn in the final practice notebook. For each task, use the cell below it to write and test your code.  You may add additional cells for any task as needed or desired.  

## Task 1a: Setup

+ `numpy` as `np`
+ `pandas` as `pd`


In [1]:
import numpy as np
import pandas as pd

## Task 2a Create a `pd.Series` object

+ Create a series of your own design.

In [2]:
my_series = pd.Series([1.0,2,3,4,5,6])
my_series2 = pd.Series([1,2,3,4,5,6])
my_series3 = pd.Series(['a','b','c'])
my_series4 = pd.Series(["a","b","c"])

print(my_series)
print(my_series2)
print(my_series3)
print(my_series4)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
0    a
1    b
2    c
dtype: object
0    a
1    b
2    c
dtype: object


## Task 2b: Creating a DataFrame

+ Create a pd.DataFrame object from a Python dictionary. Design the data as you like.

In [3]:
# class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
# index - row index
# columns - column index
# dtype - can be only one if given, otherwise inferred

df = pd.DataFrame({"one":['a','b','c'],"two":[1,2,3]     })
df

Unnamed: 0,one,two
0,a,1
1,b,2
2,c,3


In [4]:
# 'dtypes' gives the info about the data type of each column
df.dtypes

one    object
two     int64
dtype: object

In [5]:
# 'shape' gives the info about the shape of the data frame - (rows,columns)
df.shape

(3, 2)

In [6]:
# 'df.isna' returns a same size object with boolean values. True for NA values such as None or numpy.NaN

# df.duplicated
# df.duplicated(subset=None, keep='first')
# returns boolean Series specifying duplicate ROWS
# subset - check duplicate rows only for the specified columns
# keep ='first' -> returns first occurence of duplicate as False and all other occurences as True

# df.nunique
# df.nunique(axis=0, dropna=True)
# returns Series with number of distinct elements
# axis specifies which axis to look for, defaults is index (output will be Series with Column names and unique values in those)
# dropna - does not count NaN



## Task 2c: Create DataFrame with labels

+ Create a 10x5 dataframe of random numeric integers that follow a [Guassian (normal) Distribution](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.normal.html). 
  + Center the distrubtion at 0.85.
  + We will use these values as assumed grades for a class of students
+ Adjust the row indexes to be the names of hypothetical students.
+ Adjust the columsn to be the names of hypothetical projects, homework, exam names, etc.

In [7]:
grades = pd.DataFrame(np.random.normal(loc=0.85,scale = 0.5,size=(10,5)), index = ['Sai','Tyler','John','Matt','Stephen','Josh',
                                                                                  'Joel','Connor','Rish','Huiting'],
                     columns=['Homework','Project','Midterm1','Midterm2','Final'])
grades

Unnamed: 0,Homework,Project,Midterm1,Midterm2,Final
Sai,-0.141945,0.389148,0.053619,1.556097,1.360164
Tyler,1.171792,0.957454,1.063754,0.597786,0.632602
John,1.148595,0.504712,1.607621,1.099376,-0.297248
Matt,1.315958,1.84857,1.001993,-0.601429,0.324123
Stephen,0.529059,1.070142,1.428874,1.496263,0.797424
Josh,0.16781,1.722139,0.283003,0.148784,1.100305
Joel,0.095,0.344599,1.151098,0.610293,0.42197
Connor,1.068511,0.739521,0.239123,1.526574,1.195398
Rish,2.084037,0.700169,0.967318,0.67074,0.661606
Huiting,0.233114,1.436073,1.083152,0.718161,0.840944


In [8]:
grades.dtypes

Homework    float64
Project     float64
Midterm1    float64
Midterm2    float64
Final       float64
dtype: object

## Task 3a: Import the iris.csv file

+ Import the iris dataset.
+ Take a look at the `pd.read_csv` online documentation. Write example code in a Markup cell for how you would import this file if it were tab-delimited.

In [9]:
iris_df = pd.read_csv('data/iris.csv')

To read a tab-delimited file
`pandas.read_csv('file.txt',delimiter='\t')`

## Task 4a: Explore Data

 + Use `head`, `tail` and `sample` with the iris dataset.
 + Do the same with the dataset you created in task 2c.

In [10]:
grades.head()
grades.tail()
grades.sample(2)
grades.describe()

Unnamed: 0,Homework,Project,Midterm1,Midterm2,Final
count,10.0,10.0,10.0,10.0,10.0
mean,0.767193,0.971253,0.887956,0.782264,0.703729
std,0.700378,0.541185,0.52187,0.679806,0.481872
min,-0.141945,0.344599,0.053619,-0.601429,-0.297248
25%,0.184136,0.553576,0.454082,0.600913,0.474628
50%,0.798785,0.848488,1.032873,0.69445,0.729515
75%,1.165992,1.34459,1.134111,1.397041,1.035465
max,2.084037,1.84857,1.607621,1.556097,1.360164


## Task 5a: Viewing columns and rows

+ Display the columns and indexes of the iris dataset.
+ Do the same with the dataset you created in Task 2c.

In [11]:
iris_df.columns
iris_df.index
grades.columns
grades.index

Index(['Sai', 'Tyler', 'John', 'Matt', 'Stephen', 'Josh', 'Joel', 'Connor',
       'Rish', 'Huiting'],
      dtype='object')

## Task 5b: Get Values

+ Check the version of `pandas` you have 
+ Use the appropriate method to convert the iris data to a dictionary.
+ Do the same with the dataset you created in Task 2c.


In [12]:
pd.__version__
iris_dict = iris_df.values
grades_dict = grades.values
grades_dict

array([[-0.14194535,  0.38914824,  0.0536195 ,  1.55609745,  1.36016401],
       [ 1.17179177,  0.95745447,  1.063754  ,  0.59778575,  0.63260164],
       [ 1.14859468,  0.50471217,  1.60762079,  1.09937588, -0.29724808],
       [ 1.31595838,  1.84857028,  1.00199288, -0.60142874,  0.32412263],
       [ 0.52905858,  1.07014188,  1.42887438,  1.49626327,  0.79742443],
       [ 0.16781035,  1.72213946,  0.28300287,  0.1487835 ,  1.1003048 ],
       [ 0.09499994,  0.34459885,  1.15109764,  0.6102929 ,  0.42196972],
       [ 1.06851089,  0.73952149,  0.2391229 ,  1.52657368,  1.19539827],
       [ 2.08403677,  0.70016888,  0.96731823,  0.6707397 ,  0.66160609],
       [ 0.2331138 ,  1.43607323,  1.08315191,  0.71816122,  0.84094413]])

In [13]:
print(grades.columns)
grades['Homework']
grades[0:5]

Index(['Homework', 'Project', 'Midterm1', 'Midterm2', 'Final'], dtype='object')


Unnamed: 0,Homework,Project,Midterm1,Midterm2,Final
Sai,-0.141945,0.389148,0.053619,1.556097,1.360164
Tyler,1.171792,0.957454,1.063754,0.597786,0.632602
John,1.148595,0.504712,1.607621,1.099376,-0.297248
Matt,1.315958,1.84857,1.001993,-0.601429,0.324123
Stephen,0.529059,1.070142,1.428874,1.496263,0.797424


## Task 5c: Using `loc`

+ Use any iris dataframe to:
  + Select a row slice with `loc`.
  + Select a row and column slice with `loc`.
  + Take a look at the [Pandas documentation for the `at` selector](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html). Use what you learn there to select a single item with Pandas `at` accessor.

In [14]:
grades.loc['Sai':'John']
grades.loc[:,'Midterm1':'Final']
grades.loc['John':'Rish','Homework':'Midterm2']
grades.loc[['John','Rish'],['Homework','Midterm2']]

# both of the below give the same output
grades.at['Sai','Homework']
grades.loc['Sai'].at['Homework']

-0.14194534824750304

## Task 5d: Using `iloc`

+ Use any iris dataframe to:    
    + Select a row slice with `iloc`.
    + Select a row and column slice with `iloc`.
    + Take a look at the [Pandas documentation for the `iat` selector](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html). Use what you learn there to select a single item with Pandas `iat` accessor.



In [15]:
iris_df.head
iris_df.iloc[2:5]
iris_df.iloc[2:5,0:4]

# both of the below give the same output
iris_df.iat[2,0]
iris_df.iloc[2].iat[0]

4.7

## Task 5e: Boolean Indexing

+ Create subsets of the iris dataset using boolean indexes that:
    + Use one boolean operator.
    + Use two boolean operators.



In [16]:
iris_df.columns
iris_df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [17]:
iris_df[iris_df['sepal_length'] > 5.8 ].sample()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
149,5.9,3.0,5.1,1.8,virginica


In [18]:
iris_df [ iris_df['petal_length'] >  3.75].sample()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
59,5.2,2.7,3.9,1.4,versicolor


In [19]:
# remember to use the bitwise and '&', bitwise or '|' and the paranthesis for each condition

iris_df[(iris_df['sepal_length'] > 5.8 ) & (iris_df['petal_length'] >  3.75)].sample()
iris_df[(iris_df['sepal_length'] > 5.8 ) | (iris_df['petal_length'] >  3.75)].sample()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
136,6.3,3.4,5.6,2.4,virginica
