# Welcome to the Data Manipulation Lesson.
## Notebook 1

The workbook has been broken up into sections.  Each section has an article for you to read, be presented with a few questions to check your understanding, and code for you to work with related to the reading topics.


In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv("titanic.csv")

## Before You Get Started

We are going to be using the Titanic Dataset. Make sure to run a head() before you start working with manipulation methods.

In [None]:
# Run the head of your data set here:


In [None]:
# check for duplicates
data.duplicated().sum()

In [None]:
# if there are, go ahead and drop them:
data.drop_duplicates()

### Cleaning Note:

While the columns not the "prettiest", don't adjust any of them just yet. We are going to update some values and add some values as we work through this notebook. Apologies for the extra visual "noise" on your screen. You will be given the option to tidy up the columns at the end of this notebook.

## Running Tables Note:  
If your tables don't appear to have accepted your changes, try the "Run All" option in the "Cell" section of the menu bar.  

<span style="background-color:dodgerblue; color:dodgerblue;">- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -</span> 

# A. Aggregation

### 1: Groupby <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

#### Groupby "embark_town"

1. Using the titanic data set, groupby "embark_town".
1. Create a variable that will represent the grouping of data.
1. Initialize it using the groupby() function and pass it the column.

In [None]:
#SOLUTION
data_group = data.groupby("embark_town") 
data_group.first() # to check the table
# Need to use first to see the organization of the table

#### Groupby "survived"

Did you know that you can also chain on some of our exploratory methods to the groupby method?

1. Create & initalize a new variable to hold a table that will groupby "survived" 
1. Use method chaining to tack on the describe method

In [None]:
# SOLUTION
survive_group = data.groupby("survived").describe()

### 2. Aggregation Methods <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

#### Method Chaining

1. Create a variable to apply **agg()** to your grouped data.
1. Pass one of the following statistical values to **agg()**
   - "mean", "median", "mode", "min", "max", "std", "var", "first", "last", "sum"

In [None]:
#SOLUTION -- will vary based on which they choose
mean_data = data_group.agg('mean', numeric_only=True)
mean_data.head()

3. Create a variable to method chain head() with agg("sum")

In [None]:
# SOLUTION
sum_data = data_group.head().agg("sum", numeric_only=True)
sum_data.head()

In [None]:
# Explain the sum table.  What is going on with the "sex", "class", and "alive" columns?

   # Cannot sum string values, instead they are concatenated

#### Aggregation across multiple columns using dictionary functionality

##### Syntax Example:

**age_dictionary={"age":["sum", "max"]}**

1. What if we want to look at more than one column at a time?  We pass more dictionaries to the agg function.
1. Create a variable to hold at least 3 columns.  Use the syntax from the "Syntax Example" as a guide.
    - Aggregate the following:  survived: "sum" & "count"; age: "std" & "min", and sibsp: "count" & "sum"

In [None]:
#SOLUTION
test_dictionary={"survived": ["sum", "count"], 
                 "age": ["std", "min"], 
                 "sibsp": ["count","sum"]
                }

dictionary_agg=data.agg(test_dictionary)

### 3. Groupby and Basic Math <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

1. Groupby "pclass".  Make sure you use a variable to hold your grouped data.

In [None]:
#SOLUTION
num_only_data = data.select_dtypes(include='number')

passenger_class = num_only_data.groupby(['pclass'])  
passenger_class.first()

### 4. Groupby and Multiple Aggregations <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

#### Group with a List<span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span>

1. We want to do muliple aggregation functions to our newly grouped data set.  We created a variable to hold a list of functions we want to perform.  These functions are part of the agg method.  When we pass our list to the method, the method will iterate through each item and perform that function for the entire table.

In [None]:
# SOLUTION
agg_func_list = ['sum', 'mean', 'median', 'min', 'max', 'std', 'var', 'first', 'last', 'count']
passenger_class.agg(agg_func_list) 

#### Group with a Dictionary<span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span>

Using only a list provides us with the entire table.  What if we only want to look at age vs pclass?  

We can create a dictionary to hold the age column for us.  The *key* would be the name of our column, and the values our list of functions to preform on that column.  The code would look like this:

In [None]:
agg_func_dict = {
    'age':
    ['sum', 'mean', 'median', 'min', 'max', 'std', 'var', 'first', 'last', 'count']
}
# We would run our table like this:
passenger_class.agg(agg_func_dict)  

1. Looking at the *age_func_dict* syntax, create a dictionary variable for the "survived" column and pass it to **passenger_class.agg()** in the box below.

In [None]:
#SOLUTION ---
agg_func_dict = {
    'survived':
    ['sum', 'mean', 'median', 'min', 'max', 'std', 'var', 'first', 'last', 'count']
}
passenger_class.agg(agg_func_dict) 

<span style="background-color:dodgerblue; color:dodgerblue;">- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -</span> 

# B. Recoding and Creating New Values and Variables 

In [None]:
#SOLUTION
data["fare_2021"] = (data["fare"] * 117.17)  
data.head()

### Replacing Values <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 
Replace the values in the "alive" coloum from string "yes" or "no" to bools, where "yes" becomes True and "no" becomes False.

In [None]:
# SOLUTION
data["alive"] = data["alive"].replace({"yes":True, "no":False})
data.head()

We can also use functions to update values.  

1. Create a function that will set the alive values as bools.  Apply it to your table and run your table here:

In [None]:
# SOLUTION
# Note: This solution assumes the alive column still contains string values "yes"/"no"
# If you've already converted to booleans above, skip this cell or reset your data first

def set_alive_to_number(series):
    if series == "yes":
        return True
    else:
        return False

# Only run this if alive column still has string values
# data["alive"] = data["alive"].apply(set_alive_to_number)
# data.head()

### Using a function to create a new column <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

Sometimes you might want to create a new column based on combining multiple columns together.

1. create an "age_group" column that breaks years up as 0-19, 20-29, 30-39, etc until all given ages are covered.  Make sure you check to see where you can stop counting by 10s.

In [None]:
#SOLUTION

age_check = data["age"].agg("max")
age_check

def age_groups(series):
    if series <20:
        return "0-19 years"
    elif 20 <= series < 30:
        return "20-29 years"
    elif 30 <= series < 40:
        return "30-39 years"
    elif 40 <= series < 50:
        return "40-49 years"
    elif 50 <= series < 60:
        return "50-59 years"
    elif 60 <= series < 70:
        return "60-69 years"
    elif 70 <= series < 80:
        return "70-79 years"
    elif 80 <= series < 90:
        return "80-89 years"
    else:
        return "no data"
    
data["age_group"] = data["age"].apply(age_groups) 
data.head()

<span style="background-color:dodgerblue; color:dodgerblue;">- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -</span> 

# C. Reshaping Tables

### Sort_values <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

Use **sort_values()** to answer the following question:
> What is the age of the person who paid the highest fare?

Hint: We want to see the highest fare value first. What order would we want? ascending or descending?  Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html?highlight=sort_values#pandas.DataFrame.sort_values) for the syntax.

In [None]:
#SOLUTION
sort_data = data.sort_values(by="fare", ascending=False)
sort_data.head()

### pivot_table <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 
1. pivot the table of the summed data where the values are "fare", index is "who" and "age_group", and the columns are "survived"

Hint: set the aggfunc parameter to np.sum




In [None]:
#SOLUTION
data_pt = data.pivot_table("fare", ["who", "age_group"], "survived", aggfunc = 'sum') 
data_pt.head()

### Wide to Long <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

1. Create a table where the columns are "who" and the values are "pclass"
1. Answer the question:  How does this table differ from the pivot_table above?  Specifically, how is "who" different?

In [None]:
# SOLUTION
wide_long = data.pivot(columns="who", values="age_group")
wide_long.head()

# Question - how does this table differ from data_pt in the box above?
    #Specifically, how is who different?  
    # SOLUTION - using pivot, "who" created the columns, using pivot_table (above) "who" created the index

### Melt <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

1.  What does **melt** to the data?  Explain each column.

In [None]:
# What does melt do?
   # converts a wide table to a long table 

2. With a new variable, apply a default melt to your data.

In [None]:
# SOLUTION
# default melt
data_m=pd.melt(data)
data_m.shape

# (15147, 2)

data_m.head()


3. Create a melt table where the index variables are "embarked", and the values are "fare" and "deck"

In [None]:
# SOLUTION
data_melt = pd.melt(data, id_vars=["embarked"], value_vars=["fare", "deck"])
data_melt.shape 

data_melt.head()


# Optonal Challenges:

1. Clean and Explore the table.  
    1. How would you handle any missing data?
    1. Would you keep all of the columns?
    1. Would you want to manipulate any data?