In [0]:
import numpy as np
import pandas as pd

# String and Date Manipulation -- CHALLENGE!

This is a "challenge" section because the answers will not be provided! We will learn how to find the appropriate functions that we need by searching the internet and using question/answer sites like StackOverflow. (Don't worry, we'll do it together.)

Manipulating text and dates are a common part of data analysis. Sometimes we will have a single column containing several variables like plate number, well name, and experimental condition. We often need to split those values into their own columns so that we can `group_by` them seperately.

Similarly with dates, perhaps we have a single text column representing when the experiment was done, but we need to have the day and the hour in seperate columns to compare data between days.

Pandas provides functions for all of these use cases, so let's go find them with our internet search prowess!

In [0]:
titanic = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv')

titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# String Manipulation

## Text Case

Starting with our titanic dataset, we want to manipulate the text in the "Name" column by making it all lowercase. We don't know what function to use, so we will search for it:

- Go to google.com and search for
  - stackoverflow.com: Pandas text to lowercase
- Click on one of the StackOverflow search results
- Read the answer and the qustion
- Use the code cell below to convert the "Name" column to lowercase

In [0]:
# ----------------------------------------------
# YOUR CODE HERE
# Print the titanic "Name" column IN LOWERCASE
# ----------------------------------------------

titanic['Name']  # You fill in the rest

When your code prints the Name column in lowercase, then you did it. Good work!

Now find a way to print the Name column in upercase using our internet search method.

In [0]:
# ----------------------------------------------
# YOUR CODE HERE
# Print the titanic "Name" column in UPPERCASE
# ----------------------------------------------

titanic['Name'].  # You fill in the rest

# Get Last Name

Now we would like to create a new column in the titanic dataset that contains each person's last name. We know in Python that type of string operation is called `split`, so let's see if pandas has a similar function we could use.

- Go to google.com and search for
  - stackoverflow.com: Pandas split string to new column
- Click on one of the StackOverflow search results
- Read the answer and the qustion
- Create a new "last_name" column in the titanic Dataframe

In [0]:
# ---------------------------------------------------------------
# YOUR CODE HERE
# Create a last_name column containing each passenger's last name
# 
# HINT: Look at the "Name" column and notice that there is always
# a comma "," after the last name.
# ---------------------------------------------------------------

titanic['last_name'] =   # You fill in the rest

Getting the last name is a bit mor complex, so if you have trouble, create more code cells above and experiment by printing intermediate results.

Once you have a new column containing each person's last name, you did it. Good job!

# Date Manipulation

Let's import a new dataset containing dates.

In [0]:
air_quality = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_no2_long.csv')
air_quality = air_quality.rename(columns={"date.utc": "date_string"})

air_quality.head()

Unnamed: 0,city,country,date_and_time,location,parameter,value,unit
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³


Our `air_quality` dataframe contains a column named `date_and_time`. Let's inspect that column to see what datatype it is.

In [0]:
air_quality['date_string']

0       2019-06-21 00:00:00+00:00
1       2019-06-20 23:00:00+00:00
2       2019-06-20 22:00:00+00:00
3       2019-06-20 21:00:00+00:00
4       2019-06-20 20:00:00+00:00
                  ...            
2063    2019-05-07 06:00:00+00:00
2064    2019-05-07 04:00:00+00:00
2065    2019-05-07 03:00:00+00:00
2066    2019-05-07 02:00:00+00:00
2067    2019-05-07 01:00:00+00:00
Name: date_and_time, Length: 2068, dtype: object

At the bottom we see that the data type (dtype) is an object. To perform date manipulations, we need the column to have a dtype of `datetime`.

- Go to google.com and search for
  - stackoverflow.com: Pandas column to datetime
- Click on one of the StackOverflow search results
- Read the answer and the qustion
- Convert the "date_and_time" column to `datetime`

In [0]:
# ---------------------------------------------------------------
# YOUR CODE HERE
# Convert the date_and_time column to datetime
# 
# HINT: This column will work without any "format" parameters
# ---------------------------------------------------------------

air_quality['date_and_time'] =   # Your code here

Once you have converted the column to a datetime type, the following code cell will compare the earliest and latest dates, giving us the duration of our experiment.

Confirm that the code works without error. (If you get an error, go back and make sure your conversion to datetime was successfull.)

In [0]:
# Run this cell to confirm your datetime conversion worked

air_quality["date_and_time"].max() - air_quality["date_and_time"].min()

Our air quality experiment was run over several weeks, and each week was a replicated experiment. So rather than the date, we need a new "week" column so that we can create our figure.

- Go to google.com and search for
  - stackoverflow.com: Pandas get weekofyear from datetime
- Click on one of the StackOverflow search results
- Read the answer and the qustion
- Create a new "week" column from the `date_and_time` column

In [0]:
# ---------------------------------------------------------------
# YOUR CODE HERE
# Create a new week column from the datetime column
#
# ---------------------------------------------------------------

air_quality['week'] = 

Now that we have the `week` column, we can answer questions like, "what was the mean NO2 concentration each successive week?"

Confirm that the following code cell runs without error. (If you get an error, fix your code above.)

In [0]:
# Run this cell to confirm your week conversion worked

air_quality.groupby([air_quality["week"], "location"])["value"].mean()