# Data frames (main lesson)

* What is a data frame
  * Just a table, like Sheets or Excel
  * Rows, columns, variables and observations.
  * Example: loading data frame from CSV
  * Aside: what is CSV format
* A tidy dataframe
  * One variable = one column (i.e. no multiple temperature columns)
  * One observation = one row
    * The nature of "one observation" thus dictates the data format
  * Trade-offs of different definitions of "one observation"
* Exercise: create a tidy data frame from the following data:
  * I come to University at 9am, except Friday, when I come at 11am
  * I go home at 4pm, except Wednesday, when I go home at 6pm
  * I eat Ramen on Wednesday, and alternate Soba and Udon on other days, starting Monday with Soba
  * I take rest on weekends and do not come to University
* Explore data frame
  * head/tail
  * summary
  * plotting with Plotly Express
* Exercise: plot something from the dataframe from the previous exercise
* Manipulating data
  * Filtering rows
  * Dropping columns
  * Adding new derived columns
  * Split-apply-combine
* Exerise: clean up some data frame and apply some transformations
* Main exercise
  * Given a data set in CSV, load it into a dataframe
  * Apply some cleaning
  * Plot a few plots
  * Answer some qualitative question about the data based on the plots

In [1]:
import io

import numpy as np
import pandas as pd
import plotly_express as px

# What is a dataframe

A data frame is a table with the data. For examle, a standard spreadsheet with a data
can be thought of as a data frame. Let's look at an example.

In [7]:
df = pd.read_csv('tokyo-weather.csv')
df.head()

Unnamed: 0,Time_h,Temperature_C,Precipitation_mm,WindDirection,WindSpeed_ms,SunshineDuration_h,Humidity,Pressure_hPa
0,1,20.7,0,WNW,3.0,,55,1000.8
1,2,20.0,0,WNW,2.9,,58,1001.6
2,3,19.2,0,WNW,2.5,,60,1002.7
3,4,19.7,0,NNW,2.0,0.0,58,1003.8
4,5,17.8,0,WNW,3.0,0.0,69,1005.0


The data frame has columns, rows and the cells holding the values. The values in the cells can be numeric (including NaN to represent missing numbers), or they can be string values to represent text data or categorical data. The interpretation of the data frame comes from statistics.
Each column in the data frame corresponds to a variable, that is something that either
can be measured, or can be controlled by us. Each row corresponds to one observation, with
values in different columns logically being related. For example, in the table abouve,
one row coresonds to the weather data for 1 hour.

In Python Pandas library, the column types can be inspected using dtypes property. Note that numeric types
are further subdivided into integer (`int64`) and floating point (`float64`) types. The string data is represented with dtype `object`.

In [3]:
df.dtypes

Time_h                  int64
Temperature_C         float64
Precipitation_mm        int64
WindDirection          object
WindSpeed_ms          float64
SunshineDuration_h    float64
Humidity                int64
Pressure_hPa          float64
dtype: object

In [13]:
df['Time_h'].values

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])

In [18]:
df.iloc[0]

Time_h                     1
Temperature_C           20.7
Precipitation_mm           0
WindDirection            WNW
WindSpeed_ms               3
SunshineDuration_h       NaN
Humidity                  55
Pressure_hPa          1000.8
Name: 0, dtype: object

In [9]:
px.line(df, x='Time_h', y='Temperature_C')

In [11]:
px.histogram(df, x='WindDirection')

## Exercise 1: Define a data frame

In [6]:
input_csv = """index,value
1,2
3,4
"""

dd = pd.read_csv(io.StringIO(input_csv))
dd

Unnamed: 0,index,value
0,1,2
1,3,4


In [7]:
dd.index

RangeIndex(start=0, stop=2, step=1)

In [12]:
carshare = px.data.carshare()
carshare.head()

Unnamed: 0,centroid_lat,centroid_lon,car_hours,peak_hour
0,45.471549,-73.588684,1772.75,2
1,45.543865,-73.562456,986.333333,23
2,45.48764,-73.642767,354.75,20
3,45.52287,-73.595677,560.166667,23
4,45.453971,-73.738946,2836.666667,19


In [12]:
px.set_mapbox_access_token(open(".mapbox_token").read())
px.scatter_mapbox(px.data.carshare(), lat="centroid_lat", lon="centroid_lon", color="peak_hour", size="car_hours",
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10)
