# Introduction to Pandas. Part III

In [None]:
import pandas as pd

Table of Contents:

- [Handling missing values in pandas](#1.-Handling-missing-values-in-pandas)
- [Working with dates and times in pandas](#2.-Working-with-dates-and-times-in-pandas)
- [Using string methods in pandas](#3.-Using-string-methods-in-pandas)
- [Creating dummy variables in pandas](#4.-Creating-dummy-variables-in-pandas)

## 1. Handling missing values in pandas

- [Droping rows with missing values](#1.1.-Droping-rows-with-missing-values)
- [Filling in missing values](#1.2.-Filling-in-missing-values)

In [None]:
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/ufo.csv'
ufo = pd.read_csv(url)
ufo.tail()

What does "NaN" mean?

- "NaN" is not a string, rather it's a special value: numpy.nan.
- It stands for "Not a Number" and indicates a **missing value**.
- read_csv detects missing values (by default) when reading the file, and replaces them with this special value.

In [None]:
# 'isnull' returns a DataFrame of booleans (True if missing, False if not missing)
ufo.isnull()

In [None]:
# 'nonnull' returns the opposite of 'isnull' (True if not missing, False if missing)
ufo.notnull()

Documentation for [isnull](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) and [notnull](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html)

In [None]:
# count the number of missing values in each Series
ufo.isnull().sum()

This calculation works because:

- The sum method for a DataFrame operates on axis=0 by default (and thus produces column sums).
- In order to add boolean values, pandas converts True to 1 and False to 0.

In [None]:
# use the 'isnull' Series method to filter the DataFrame rows
ufo[ufo.City.isnull()] # return rows with missing 'City'

In [None]:
ufo[ufo['Shape Reported'].isnull()] # returns rows with missing 'Shape Reported'

In [None]:
ufo[ufo['Colors Reported'].notnull()].head() # returns rows such that 'Colors Reported' is not missing

**How to handle missing values** depends on the dataset as well as the nature of your analysis. Here are some options:

### 1.1. Droping rows/columns with missing values

 Documentation for [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) ('inplace' parameter for 'dropna' is False by default, thus rows are only dropped temporarily)

In [None]:
# if 'all' values are missing on a row, then drop that row (none are dropped in this case)
ufo.dropna(axis=1,how='all')

In [None]:
# if 'any' values are missing in a column, then drop that column
ufo.dropna(axis=1,how='any')

In [None]:
# drop a column only if more that 75% of its values are missing
pct_missing = ufo.isnull().sum()/len(ufo)
pct_missing

In [None]:
cols_todrop = pct_missing[pct_missing>0.75].index
cols_todrop

In [None]:
ufo.drop(cols_todrop,axis=1)

In [None]:
# if 'all' values are missing in a row, then drop that row 
ufo.dropna(how='all')

In [None]:
# # if 'any' values are missing in a row, then drop that row
ufo.dropna(how='any')

In [None]:
# if 'any' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row
ufo.dropna(subset=['City', 'Shape Reported'], how='any')

In [None]:
# if 'all' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row
ufo.dropna(subset=['City', 'Shape Reported'], how='all')

### 1.2. Filling in missing values

Documentation for [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

In [None]:
# 'value_counts' does not include missing values by default
ufo['Shape Reported'].value_counts()

In [None]:
# explicitly include missing values
ufo['Shape Reported'].value_counts(dropna=False)

In [None]:
# fill in missing values with a specified value
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

In [None]:
# confirm that the missing values were filled in
ufo['Shape Reported'].value_counts(dropna=False)

## 2. Working with dates and times in pandas

In [None]:
ufo.head()

In [None]:
# 'Time' is currently stored as a string
ufo.dtypes

In [None]:
ufo.Time[0] # returns a string

In [None]:
# convert 'Time' to datetime format
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.head()

In [None]:
ufo.Time[0] # returns a Timestamp

Convenient Series attributes are now available through the .dt attribute

In [None]:
ufo.Time

In [None]:
ufo.Time.dt.year

In [None]:
ufo.Time.dt.weekday

In [None]:
ufo.Time.dt.hour

In [None]:
ufo.Time.dt.dayofyear

**Trick 1:**  filter by date

In [None]:
# convert a single string to datetime format (outputs a timestamp object)
ts = pd.to_datetime('1/1/1999')
ts

In [None]:
# compare a datetime Series with a timestamp
ufo.loc[ufo.Time >= ts, :]

**trick 2:** perform mathematical operations with timestamps (outputs a timedelta object)

In [None]:
ufo.Time.max() #latest date

In [None]:
ufo.Time.min() # earliest date

In [None]:
ufo.Time.max()-ufo.Time.min() # difference between earliest and latest row

In [None]:
time = ufo.Time.max()-ufo.Time.min()

In [None]:
time.days

**EXTRA**: plot number of ufos reported by year

In [None]:
ufo['year'] = ufo.Time.dt.year
ufo.head()

In [None]:
ufo.year.value_counts()

In [None]:
ufo.year.value_counts().plot()

In [None]:
ufo.year.value_counts().sort_index().plot(figsize=(12,5))

**EXTRA:** Indexing by timestamps

Where the Pandas time series tool really become useful is when you begin to index data by timestamps (more on this in Part iv)

In [None]:
ufo = ufo.set_index('Time')
ufo

Interestingly, you don't need the .dt attribute to access Pandas' timestamp Series attributes

In [None]:
ufo.index.year

In [None]:
ufo.index.hour

## 3. Using string methods in pandas

In [None]:
# read a dataset of Chipotle orders into a DataFrame
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/chipotleorders.csv'
orders = pd.read_csv(url)
orders.head()

In [None]:
# normal way to access string methods in Python
'hello'.upper()

In [None]:
'hello'.islower()

In [None]:
'hello'.isupper()

In [None]:
'hello, my name is Javier'.split(',')

In [None]:
'hello, my name is Javier'.split(' ')

In [None]:
'hello, my name is Javier'.replace('Javier','Bob')

String methods for pandas Series are accessed via 'str'

In [None]:
orders.item_name

In [None]:
orders.item_name.str.upper()

In [None]:
# string method 'contains' checks for a substring and returns a boolean Series
orders.item_name.str.contains('Chicken')

In [None]:
# use the boolean Series to filter the DataFrame
orders.loc[orders.item_name.str.contains('Chicken'),:] # rows that have chicken in the item_name

In [None]:
# string methods can be chained together
orders.choice_description.str.replace('[', '').str.replace(']', '') # remove square brackets 

## 4. Creating dummy variables in pandas

In [None]:
# read the training dataset from Kaggle's Titanic competition
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/titanic.csv'
titanic = pd.read_csv(url)
titanic.head()

In [None]:
titanic.Sex.value_counts()

In [None]:
# create the 'Sex_male' dummy variable using the 'map' method
titanic['Sex_male'] = titanic.Sex.map({'female':0,'male':1})
titanic.head()

In [None]:
# alternative: use 'get_dummies' to create on column for every possible value
pd.get_dummies(titanic.Sex)

Generally speaking:

- If you have "K" possible values for a categorical feature, you only need "K-1" dummy variables to capture all of the information about that feature.
- One convention is to drop the first dummy variable, which defines that level as the "baseline".

In [None]:
# drop the first dummy variable ('female') using the 'iloc' method
pd.get_dummies(titanic.Sex).iloc[:,1].head()

In [None]:
# add a prefix to identify the source of the dummy variables
pd.get_dummies(titanic.Sex, prefix='Sex').head()

In [None]:
# use 'get_dummies' with a feature that has 3 possible values
titanic.Embarked.value_counts()

In [None]:
pd.get_dummies(titanic.Embarked, prefix='Embarked').head(10)

In [None]:
# drop the first dummy variable ('C')
pd.get_dummies(titanic.Embarked, prefix='Embarked').iloc[:, 1:].head(10)

How to translate these values back to the original 'Embarked' value:

- 0, 0 means C
- 1, 0 means Q
- 0, 1 means S

In [None]:
# save the DataFrame of dummy variables and concatenate them to the original DataFrame
embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:,1:]
titanic = pd.concat([titanic,embarked_dummies], axis=1)
titanic.head()

In [None]:
# reset the DataFrame
titanic = pd.read_csv('http://bit.ly/kaggletrain')
titanic.head()

In [None]:
# pass the DataFrame to 'get_dummies' and specify which columns to dummy (it drops the original columns)
pd.get_dummies(titanic, columns=['Sex','Embarked']).head()

In [None]:
# use the 'drop_first' parameter to drop the first dummy variable for each feature
pd.get_dummies(titanic,columns=['Sex','Embarked'],drop_first=True).head()

In [None]:
titanic