# Introduction to Pandas I
Welcome to the Pandas I. In this lesson we will be covering: 
- **Creating, reading and writing DataFrames**
- **Indexing in Pandas**
- **Mapping and Summarizing with Pandas**

The lab for Lesson 3 will consist of the nine exercises that you will find throughtout the notebook. 

For this exercise we wil be using the Titanic Survival Dataset from Kaggle. We will perform various tranformations, edits and exploration. 

These are the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `?`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `?`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
- **Home.Dest**: Home / Destination

Let's start with importing some of the libraries we will be using.

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt

### Why Pandas?
Pandas is the go-to python library to use for any data scientist or machine learning engineer. It allows for easy data access, easy data manipulation, and it's free! Let's start off by reading the titanic survival dataset. 

## Creating and Reading Data

### Reading in Data

In [None]:
# Read in the titanic survival dataset 
titanic_data = pd.read_csv('titanic_data.csv')
titanic_data

Pandas has various options to read data. If you are curious, you can read about them here: [Pandas Input/Output](https://pandas.pydata.org/pandas-docs/stable/reference/io.html). For the majority of the exercises, we will be using `read_csv`. If you would like to see all the options for read_csv, [here is the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv). 
A CSV file is a table of values seperated by commas (CSV =  Comma Seperated Values). The file looks like:

data1,data2,data3,...,datan

The data is read into a DataFrame. A DataFrame, as you can see from above, is a table. Every entry corresponds to a row and column.

### Pandas Series 
A Series is a sequence of data. A dataframe is a table and a series is a list. What's the difference between a Series and a DataFrame? A Dataframe is a table and a Series would be just a single column of data. DataFrames have column names and Series do not. 

In [None]:
# Lets take a look at A Series within the titanic survival dataset. 
titanic_series_data = titanic_data['survived']
titanic_series_data

In [None]:
# EXERCISE 1
# choose a column and create your own series below

### Creating Data Using Pandas
If you wanted to create your own dataset using pandas you can use the DataFrame method. The DataFrame method can take in data in various formats and creates a dataframe object that can be used to store data. In the example below we are creating a new DataFrame. Note the syntax required to define a header (Survivor_#) versus the data (Name). 

In [None]:
my_data_frame = pd.DataFrame({"Survivor_1":["Nelson"],"Survivor_2":["Monica"],"Survivor_3":["Marlon"]})
my_data_frame

In [None]:
# Think: how would you add a second row to the data above?

In [None]:
# EXERCISE 2
# Create your own DataFrame. Name your DataFrame your_name_df, and assign column values with your first and last name
# and the values will be the length of your first and last name


## Selecting, Indexing and Assigning
Selecting, indexing, and assigning data will be your bread and butter of data analysis. These tools will allow you to select aspects of your data quickly and efficiently to analyze. 

### Native Accessing 
Let's start by selecting a column of data. There are two ways to do this, either by accessing the column using the dot operator or by referencing the column explicitly.

In [None]:
# Selecting one column data using the dot operator
titanic_data.ticket

In [None]:
# Selecting one column by referencing the column explicitly
titanic_data['ticket']

There is no real advantage of which to use. They both perform the same function, but when the column is referenced explicitly, you can use column names with special characters such as: "Test Data", "First.Last".

In [None]:
# EXERCISE 3
# Select the home.dest column fromt the titanic survival dataset and print it to the screen. 


### Indexing in Pandas

Now let's take a look at the first value of a column. Let's reference the column explicitly using square brackets:

In [None]:
titanic_data['ticket'][1]

We were able to access the second row *(zero-based indexing)* of data from the ticket column by referencing the column explicity and by using square brackets. Now the method that is typically used in industry, and is the recommended way of accessing data, is iloc. 

Lets use iloc below and access the same data as we did above

In [None]:
# Using iloc by natively acessing a dataframe column. 
titanic_data['ticket'].iloc[1]

We used `iloc` above to access the second value, `iloc[1]`, from the column ticket.

`iloc` accesses data by using row-first, column-second.

- iloc[row,column]

If we wanted to perform the operation above without natively accessing the column, we can use `iloc` in the following way 

In [None]:
# Using iloc to access the first row, from the seventh column
titanic_data.iloc[1,7]

Now let's show more ways we can use `iloc`. 

In [None]:
# Using iloc lets access all the rows in the first column
titanic_data.iloc[:,0]

In [None]:
# Using iloc to access the first three rows, from the first column
titanic_data.iloc[:3,0]

In [None]:
# Using iloc to access the last 5 rows from the first column
titanic_data.iloc[-5:,0]

`iloc` is not the only method that can be used to access data. The `loc` method can also be used similarly, but there is a difference between the two. `loc` is used to access data using label-based selection. So when we use `loc` we will include a label to use as shown below.

In [None]:
#Using loc to access the first three rows from the age column
titanic_data.loc[0:2,'age']

In [None]:
# EXERCISE 4
# Describe the difference between iloc and loc, using your own words

#Answer here

Now let's give it a try on our own. Perform the following operations using iloc:

In [None]:
# EXERCISE 5
# Access the 100th row from the cabin column
cabin_value = 
print("The cabin value of the 100th row is {}".format(cabin_value))


# Access rows 300-310 from the ticket column 
ticket_values = 
print("Tickets values from indicies 300-310 are: {}".format(ticket_values))


# Access the the second row, and the last 5 columns from the Titanic Data set 
five_columns_value = 
print("The last five columns contain the following values in the first row: {}".format(five_columns_value))

### Manipulating the Index
If you have not noticed yet, everytime we print a pandas DataFrame to the screen, there is a column at the beginning that starts at 0. This column is the index column and it is what Pandas uses as a refrence point for every DataFrame. Let's take a look at how to access the index column.  

In [None]:
titanic_data.index

The information above details that the index starts at 0, increments by 1 and stops at 1309.

The index will not always be in order, or use numbers as increments. You can use label values as index, for which you can later use to help you make graphs, or use `loc` to access specific data. Below we will change the index to sex, and show you how you can change the index value in a Pandas DataFrame

In [None]:
#changing the index
titanic_data_new_index = titanic_data.set_index("sex")
titanic_data_new_index

### Assigning data
To assign new data to an already existing DataFrame is very simple. Let's add our own column to the titanic dataset:

In [None]:
# Add a new column named Ship_Name and populate it with the value Titanic
titanic_data['Ship_Name'] = "Titanic"
titanic_data

As you can see we created the column "Ship_Name" and assigned it a string value of "Titanic".

We can now combine some of what we have learned to create two new columns from the data we have. Using the name column, let's seperate the values into first and last name. The code below may seem new or a bit complicated, but don't worry, we will discuss what we are doing in the class.  
Run the cell below to create the new columns.

In [None]:
# Create the last name column
titanic_data_last_name = titanic_data
titanic_data_last_name['last_name'] = titanic_data['name'].str.split().str[0].replace(',','',regex=True)
titanic_data_last_name

In [None]:
# Note that the split() method turns a string into a list, then you choose which value from the list you want to use. 

In [None]:
# Create the first name column
titanic_data_name = titanic_data_last_name
titanic_data_name['first_name'] = titanic_data['name'].str.split().str[2].replace(',','',regex=True)
titanic_data_name

As you can see, we now have two new columns! first name and last_name. Let's try creating some columns of your own.

In [None]:
# EXERCISE 6 
# Create a new column called Deck_Location, and assign the value "Upper" to the the first half of the dataset, 
# and Lower to the second half of the dataset. Print the dataset once you are done.


### Conditional Selection
Now we have been using `iloc` to select rows and columns from a dataset. But what if you wanted to find values that matched a certain condition?  
This is where we can use conditional selection. In the following cells we will find values that meet conditions from our titanic dataset.

In [None]:
# Using conditional selection, let's find all the female passengers 
titanic_data.sex == 'female'

# Here, we're using the dot operator to access the 'sex' column and setting it equal to 'female'. 
# This will return True if the passenseg is a female and False if the passenger is a male

The output above gives us the index values of where our condition was either True or False. But what if we wanted to see all the date for female passengers instead the conditonal values? We can use `loc` to do this.

In [None]:
# Show the data for female passengers only
titanic_data.loc[titanic_data.sex == 'female']

We now have the data for all the female passengers using conditional selection.  
Let's try a few more:

In [None]:
# Access the data where for the female passengers that survived
titanic_data.loc[(titanic_data.sex == 'female') & (titanic_data.survived == 1)]

# Note that the notation above uses '&' instead of 'and'
# Using and (logical operator) asks python to check if the entire sex column is equal to 'female'
# & (bitwise operator) allows python to check individual values within the column

In [None]:
# Access the data where the passenger last name is either Allen or Allison 
titanic_data.loc[titanic_data.last_name.isin(['Allen','Allison'])]

# the 'isin()' method allows you to filter the dataset by selecting rows with a particular value(s)

Let's try conditional selection on our own. 

In [None]:
# EXERCISE 7

# Access the values where the passenger survived and their first name was Elisabeth

In [None]:
# Access the values where the fare value equals "2665" and the passenger survived 

In [None]:
# Access the values where the fare values equals "113781" and the passenger survived

## Summary and Mapping

### Summary Functions 
Summary functions are used to quickly summarize your data. They are very handy tools when you are working with a new dataset and we will explore in a later lesson why that is important. 

Let's start by using using the `describe()` method. 

In [None]:
# Use the describe method to quickly summarize the titanic survival dataset
titanic_data.describe()

The result of the `describe()` method are 8 values: 
- **count:** 
  count number of non-NA/null observations.
- **mean:**
  mean of the values.
- **std:**
  standard deviation of the observations.
- **min:**
  minimum of the values in the object.
- **max:**
  maximum of the values in the object.
- **25%-75%:**
  these values represent quartiles. We will go over what these mean in another lesson


If you want to only calculate one value, you can use the following:

In [None]:
# individual columns stats 
titanic_data.survived.mean()

In [None]:
# EXERCISE 8

# Calculate the count of the survived column


#### The `unique()` and `value_counts()` methods 

The following methods allow you to quickly summarize a column:

- **unique:** This method will display all the unique values of your column. 
- **value_counts:** This method will display all the unique values and their counts. 

In [None]:
# Unique names
titanic_data.sex.unique()

In [None]:
#value counts 
titanic_data.sex.value_counts()

In [None]:
# EXERCISE 9

# Use the value_counts() method on the AGE column and print the results. 


### Mapping Functions
`map()` is a method that is very useful within Data Science and AI to alter or create new dipictions of data. The method takes a function and then maps that function to a set of values. This method can only be used with Series data and not a pandas dataframe.

Let's use both `map()` and `lambda` to normalize age values in the titanic dataset. 

To normilie the data, we will be "scaling" the values to be between 0 and 1 using the following formula:

$z_{nor}$ = (x - $x_{min}$) / ($x_{max}$ - $x_{min}$)

In [None]:
# First, we will use the code below to convert the age column to numeric values and get rid of the '?' in the data
titanic_data['age'] = pd.to_numeric(titanic_data['age'],errors='coerce')

# Find the max and min from the age column
max_age = titanic_data.age.max()
min_age = titanic_data.age.min()

# Use the Mapping Function
titanic_data.age.map(lambda p: (p - min_age)/(max_age - min_age))

Another useful method that we can use to alter data is the `apply()` method. This method is similar to the `map()` method, but it can be used with a pandas DataFrame.

Let's create a method that will convert the embarked column from the town of embarkation to the country of embarkation. 

In [None]:
def rename_embarked(row):
    if row.embarked == 'S':
        row.embarked = 'UK'
    elif row.embarked == 'C':
        row.embarked = 'FR'
    elif row.embarked == 'Q':
        row.embarked = 'IE'
    return row
    
titanic_data = titanic_data.apply(rename_embarked,axis='columns')
titanic_data

In [None]:
# EXERCISE 9

# Replace the `?` value in the fare column with NAN and print the new fare column.
# Create your own mapping function for the fare column and print the new DataFrame.