# Lab 3 - Complex Data and Visualization

Last time we saw a method for selecting and working with data
in a table. We saw how Pandas as a library allowed us to work
with data like a spreadsheet and also go beyond simple
selection and manipulation.

In this class we will work with data about the temperatures and
temperature changes of different cities over time.

[Temperatures](https://docs.google.com/spreadsheets/d/1Jwcr6IBJbOT1G4Vq7VqaZ7S1V9gRmUb5ALkJPaG5fxI/edit?usp=sharing)

Before we dive in let's do a little review of some of the methods we saw in last class.

In [1]:
import pandas as pd
import altair as alt

Recall that we first need to load in our data. We saw the `read_csv`
function from last time. We need to add a bit of extra options in
order to load this data in. In particular we want to have a date
column.  One way you can look this up is through the function
[documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).
Although remember! Often the best thing to do is to find the
[answer](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
on stack overflow.

In [2]:
df = pd.read_csv("data/Temperatures.csv",
                 index_col=0,
                 parse_dates=["dt"])
df

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
58171,1843-04-01,,,Acapulco,Mexico,16.87,-100.47
58174,1843-07-01,,,Acapulco,Mexico,16.87,-100.47
58177,1843-10-01,,,Acapulco,Mexico,16.87,-100.47
58180,1844-01-01,,,Acapulco,Mexico,16.87,-100.47
58183,1844-04-01,,,Acapulco,Mexico,16.87,-100.47
...,...,...,...,...,...,...,...
8494752,2012-07-01,23.213,0.759,Zapopan,Mexico,20.09,-104.08
8494755,2012-10-01,22.456,0.648,Zapopan,Mexico,20.09,-104.08
8494758,2013-01-01,18.463,0.663,Zapopan,Mexico,20.09,-104.08
8494761,2013-04-01,22.464,0.346,Zapopan,Mexico,20.09,-104.08


Let's now review the different tools that we have available for us. 
We can see the different columns in the table. 

In [3]:
df.columns

Index(['dt', 'AverageTemperature', 'AverageTemperatureUncertainty', 'City',
       'Country', 'Latitude', 'Longitude'],
      dtype='object')

We can also filter the table to find only the rows with certain filtered values.

In [4]:
filter = df["City"] == "New York"
nyc_df = df.loc[filter]
nyc_df

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
5203975,1744-01-01,,,New York,United States,40.99,-74.56
5203978,1744-04-01,9.788,2.151,New York,United States,40.99,-74.56
5203981,1744-07-01,22.207,1.305,New York,United States,40.99,-74.56
5203984,1744-10-01,8.968,1.558,New York,United States,40.99,-74.56
5203987,1745-01-01,-2.363,1.771,New York,United States,40.99,-74.56
...,...,...,...,...,...,...,...
5207197,2012-07-01,24.479,0.403,New York,United States,40.99,-74.56
5207200,2012-10-01,12.436,0.344,New York,United States,40.99,-74.56
5207203,2013-01-01,-0.968,0.290,New York,United States,40.99,-74.56
5207206,2013-04-01,9.723,0.355,New York,United States,40.99,-74.56


We have seen how we can use multiple filters and combine them with
elements like or `|` and `&`.

In [5]:
filter = (df["City"] == "New York") | (df["City"] == "Philadelphia") 
nyc_phila_df = df.loc[filter]
nyc_phila_df

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
5203975,1744-01-01,,,New York,United States,40.99,-74.56
5203978,1744-04-01,9.788,2.151,New York,United States,40.99,-74.56
5203981,1744-07-01,22.207,1.305,New York,United States,40.99,-74.56
5203984,1744-10-01,8.968,1.558,New York,United States,40.99,-74.56
5203987,1745-01-01,-2.363,1.771,New York,United States,40.99,-74.56
...,...,...,...,...,...,...,...
5849433,2012-07-01,26.118,0.315,Philadelphia,United States,39.38,-74.91
5849436,2012-10-01,14.584,0.321,Philadelphia,United States,39.38,-74.91
5849439,2013-01-01,2.252,0.196,Philadelphia,United States,39.38,-74.91
5849442,2013-04-01,11.459,0.319,Philadelphia,United States,39.38,-74.91


Once there is a dataframe that is filtered in a specific manner
we can use it to compute properties on the remaining data.

In [6]:
average_temp = nyc_df["AverageTemperature"].mean()
average_temp

9.514878846153847

Finally we can add new columns by setting them in the original dataframe. 

In [7]:
def in_nyc(city):
    "Returns Yes if country is in the US or Canada "
    if city == "New York":
        return True
    return False
df["InNYC"] = df["City"].map(in_nyc)
df

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude,InNYC
58171,1843-04-01,,,Acapulco,Mexico,16.87,-100.47,False
58174,1843-07-01,,,Acapulco,Mexico,16.87,-100.47,False
58177,1843-10-01,,,Acapulco,Mexico,16.87,-100.47,False
58180,1844-01-01,,,Acapulco,Mexico,16.87,-100.47,False
58183,1844-04-01,,,Acapulco,Mexico,16.87,-100.47,False
...,...,...,...,...,...,...,...,...
8494752,2012-07-01,23.213,0.759,Zapopan,Mexico,20.09,-104.08,False
8494755,2012-10-01,22.456,0.648,Zapopan,Mexico,20.09,-104.08,False
8494758,2013-01-01,18.463,0.663,Zapopan,Mexico,20.09,-104.08,False
8494761,2013-04-01,22.464,0.346,Zapopan,Mexico,20.09,-104.08,False


## Advanced table functions

Our filters have mainly tried to filter rows by string values,
but we can filter by many different properties. These properties
depend on the type of the column.

If you remember back to lesson 1, we saw how we could use a dates
in python. 

In [8]:
import datetime
date1 = datetime.datetime.now()
date1

datetime.datetime(2021, 5, 27, 12, 7, 4, 339096)

In [9]:
print(date1.day, date1.month, date1.year)

27 5 2021


Let's look now at the types of our columns. 

In [10]:
df.dtypes

dt                               datetime64[ns]
AverageTemperature                      float64
AverageTemperatureUncertainty           float64
City                                     object
Country                                  object
Latitude                                float64
Longitude                               float64
InNYC                                      bool
dtype: object

We can see that `dt` is a date. Therefore
we can access similar properties as we have seen in the table. 

In [11]:
df["dt"].dt.month

58171       4
58174       7
58177      10
58180       1
58183       4
           ..
8494752     7
8494755    10
8494758     1
8494761     4
8494764     7
Name: dt, Length: 76224, dtype: int64

Let's convert these into columns

In [12]:
df["Month"] = df["dt"].dt.month
df["Year"] = df["dt"].dt.year

Now these are columns in the table. 

In [13]:
df.columns

Index(['dt', 'AverageTemperature', 'AverageTemperatureUncertainty', 'City',
       'Country', 'Latitude', 'Longitude', 'InNYC', 'Month', 'Year'],
      dtype='object')

## Advanced  Filtering

In [14]:
filter = (df["Month"] == 7) & (df["Year"] == 1950)
summer = nyc_df.loc[filter]
summer

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
5206453,1950-07-01,21.613,0.133,New York,United States,40.99,-74.56


In [15]:
filter = (df["Year"] >= 1950) & (df["Year"] <= 1960) 
fifties = df.loc[filter]
fifties

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude,InNYC,Month,Year
59452,1950-01-01,25.870,0.506,Acapulco,Mexico,16.87,-100.47,False,1,1950
59455,1950-04-01,24.797,0.812,Acapulco,Mexico,16.87,-100.47,False,4,1950
59458,1950-07-01,27.149,0.352,Acapulco,Mexico,16.87,-100.47,False,7,1950
59461,1950-10-01,26.514,0.430,Acapulco,Mexico,16.87,-100.47,False,10,1950
59464,1951-01-01,24.746,0.505,Acapulco,Mexico,16.87,-100.47,False,1,1951
...,...,...,...,...,...,...,...,...,...,...
8494119,1959-10-01,21.412,0.569,Zapopan,Mexico,20.09,-104.08,False,10,1959
8494122,1960-01-01,18.382,0.358,Zapopan,Mexico,20.09,-104.08,False,1,1960
8494125,1960-04-01,20.904,0.493,Zapopan,Mexico,20.09,-104.08,False,4,1960
8494128,1960-07-01,23.138,0.232,Zapopan,Mexico,20.09,-104.08,False,7,1960


How does the temperature vary over a 10 year period?

In [16]:
filter = df["InNYC"] & (df["Year"] >= 1950) & (df["Year"] <= 1960) 
period = df.loc[filter]

## Visualization

Now lets look at how to graph this data. 

In [17]:
simple_df = pd.DataFrame({
    "names" : ["a", "b", "c"],
    "val1" : [10., 20., 30.],
    "val2" : [10., 20., 30.],
})
simple_df

Unnamed: 0,names,val1,val2
0,a,10.0,10.0
1,b,20.0,20.0
2,c,30.0,30.0


Review

In [18]:
chart = (alt.Chart(simple_df)
           .mark_bar()
           .encode(x = "names",
                   y = "val1"))
chart

In [19]:
chart = (alt.Chart(simple_df)
           .mark_line()
           .encode(x = "names",
                   y = "val1"))
chart

* Chart - Over the last 10 years
* Mark - Line graph
* Encode - date and temperature

In [20]:
chart = (alt.Chart(period)
           .mark_line()
           .encode(x = "dt",
                   y = "AverageTemperature"))
chart

We can instead graph different properties.

* Chart - Over the last 10 years
* Mark - Line graph
* Encode - date and temperature month

In [21]:
chart = (alt.Chart(period)
            .mark_line()
            .encode(x = "Year",
                    y = "AverageTemperature",
                    color = "Month:N"))
chart

How does the temperature vary over a 200 year period?

In [22]:
filter = df["InNYC"] & (df["Year"] >= 1800) & (df["Year"] <= 2000)
period = df.loc[filter]
period

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude,InNYC,Month,Year
5204647,1800-01-01,-3.592,2.199,New York,United States,40.99,-74.56,True,1,1800
5204650,1800-04-01,10.463,2.164,New York,United States,40.99,-74.56,True,4,1800
5204653,1800-07-01,23.297,2.179,New York,United States,40.99,-74.56,True,7,1800
5204656,1800-10-01,10.432,3.451,New York,United States,40.99,-74.56,True,10,1800
5204659,1801-01-01,-3.215,2.055,New York,United States,40.99,-74.56,True,1,1801
...,...,...,...,...,...,...,...,...,...,...
5207044,1999-10-01,10.438,0.194,New York,United States,40.99,-74.56,True,10,1999
5207047,2000-01-01,-3.168,0.238,New York,United States,40.99,-74.56,True,1,2000
5207050,2000-04-01,8.897,0.190,New York,United States,40.99,-74.56,True,4,2000
5207053,2000-07-01,20.727,0.407,New York,United States,40.99,-74.56,True,7,2000


How does the temperature vary over a 5 year period?

In [23]:
filter = df["City"].isin(["New York", "Los Angeles", "Detroit"]) & (df["Year"] >= 1950) & (df["Year"] <= 1960)
period = df.loc[filter]
chart = (alt.Chart(period)
         .mark_line()
         .encode(x = "dt",
                 y = "AverageTemperature",
                 color = "City",
                 strokeDash = "City"))
chart

How does the temperature vary with latitude?

In [24]:
filter = ((df["Country"] == "United States") &
         (df["Year"] == 1950) &
         (df["Month"] == 7))
period = df.loc[filter]
chart = (alt.Chart(period)
         .mark_point()
         .encode(
             y = "AverageTemperature",
             x = "Latitude",
             tooltip=["City", "Country"],
         ))
chart

In [25]:
filter = ((df["Country"] == "United States") &
         (df["Year"] == 1950))
period = df.loc[filter]
chart = (alt.Chart(period)
         .mark_point()
         .encode(
             y = "AverageTemperature",
             x = "Latitude",
             tooltip=["City", "Country"],
             facet="Month"
         ))
chart

## Advanced: GroupBys

GroupBys

1) Filter - Figure out the data to start with
2) GroupBy - Determine the subset of data to use
3) Aggregation - Compute a property over the group 

In [26]:
# 1) Filter
filter = ((df["Country"] == "United States") &
          (df["Year"] == 1950))

In [27]:
# 2) Group By
grouped = df.loc[filter].groupby(["Country"])

In [28]:
# 3) Aggregated
temperature = grouped["AverageTemperature"].agg(['mean'])
temperature

Unnamed: 0_level_0,mean
Country,Unnamed: 1_level_1
United States,14.842221


In [29]:
# 2) Group By
grouped = df[filter].groupby(["City"])

In [30]:
# 3) Aggregated
temperature = grouped["AverageTemperature"].agg(['mean'])
temperature

Unnamed: 0_level_0,mean
City,Unnamed: 1_level_1
Albuquerque,12.5025
Austin,21.2855
Baltimore,13.2975
Boston,8.8145
Charlotte,17.554
Chicago,10.52575
Columbus,15.435625
Dallas,18.63575
Denver,9.7135
Detroit,9.2835


In [31]:
# 2) Group By
grouped = df[filter].groupby(["Year", "Country"])

In [32]:
# 3) Aggregated
temperature = grouped["AverageTemperature"].agg(['mean'])
temperature

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
Year,Country,Unnamed: 2_level_1
1950,United States,14.842221


Which cities temperature changes the most during the year?

In [33]:
grouped = df.groupby(["City", "Latitude"])

In [34]:
var = grouped["AverageTemperature"].agg(['mean', 'std'])
var

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std
City,Latitude,Unnamed: 2_level_1,Unnamed: 3_level_1
Acapulco,16.87,26.056345,1.317275
Aguascalientes,21.70,17.983196,2.971204
Albuquerque,34.56,11.127311,8.366040
Apodaca,26.52,21.926115,5.878348
Austin,29.74,19.877495,6.833518
...,...,...,...
Vancouver,49.03,7.189567,5.765395
Veracruz,18.48,23.063930,2.507376
Villa Nueva,15.27,19.117358,1.559936
Winnipeg,50.63,1.269699,13.762060


In [35]:
var = var.reset_index().sort_values("Latitude")
chart = alt.Chart(var).mark_bar().encode(
    y = "mean",
    x = alt.X("City", sort=None),
)
chart2 = chart.mark_point(color='red').encode(
    y = "std",
    x = alt.X("City", sort=None),
    )
chart = chart + chart2
chart