Spring 2025 <br>
Lecture 08

## Line Plots and Scatterplots

Color References
- List of color names: https://www.w3.org/wiki/CSS/Properties/color/keywords
- List of sequential palettes: https://plotly.com/python/builtin-colorscales/
- List of discrete / categorical palettes: https://plotly.com/python/discrete-color/

Dataset Metadata
- Texas Housing: https://ggplot2.tidyverse.org/reference/thousing.html
- MSleep (Mammal Sleeping Patterns): https://www.rdocumentation.org/packages/ggplot2/versions/3.5.0/topics/msleep

In [3]:
# Imports

# ! conda install -y statsmodels
import pandas as pd
import plotly.express as px


# Data
df_texas = pd.read_csv("data/txhousing.csv")
display(df_texas.head())

df_msleep = pd.read_csv("data/msleep.csv")
display(df_msleep)

Unnamed: 0,rownames,city,year,month,sales,volume,median,listings,inventory,date
0,1,Abilene,2000,1,72.0,5380000.0,71400.0,701.0,6.3,2000.0
1,2,Abilene,2000,2,98.0,6505000.0,58700.0,746.0,6.6,2000.083333
2,3,Abilene,2000,3,130.0,9285000.0,58100.0,784.0,6.8,2000.166667
3,4,Abilene,2000,4,98.0,9730000.0,68600.0,785.0,6.9,2000.25
4,5,Abilene,2000,5,141.0,10590000.0,67300.0,794.0,6.8,2000.333333


Unnamed: 0,name,genus,vore,order,conservation,sleep_total,sleep_rem,sleep_cycle,awake,brainwt,bodywt
0,Cheetah,Acinonyx,carni,Carnivora,lc,12.1,,,11.9,,50.000
1,Owl monkey,Aotus,omni,Primates,,17.0,1.8,,7.0,0.01550,0.480
2,Mountain beaver,Aplodontia,herbi,Rodentia,nt,14.4,2.4,,9.6,,1.350
3,Greater short-tailed shrew,Blarina,omni,Soricomorpha,lc,14.9,2.3,0.133333,9.1,0.00029,0.019
4,Cow,Bos,herbi,Artiodactyla,domesticated,4.0,0.7,0.666667,20.0,0.42300,600.000
...,...,...,...,...,...,...,...,...,...,...,...
78,Tree shrew,Tupaia,omni,Scandentia,,8.9,2.6,0.233333,15.1,0.00250,0.104
79,Bottle-nosed dolphin,Tursiops,carni,Cetacea,,5.2,,,18.8,,173.330
80,Genet,Genetta,carni,Carnivora,,6.3,1.3,,17.7,0.01750,2.000
81,Arctic fox,Vulpes,carni,Carnivora,,12.5,,,11.5,0.04450,3.380


## Examples

**Line Plots**
1. Create a line plot of the total number of listing in Texas over time.
2. Create a line plot of the housing prices per city over time.

**Scatter Plots**
3. Create a scatterplot of body weight vs. brain weight.
4. Create a scatterplot of REM sleep vs. total sleep.

### Example 1

- At minimum, for a line plot --- you will always need a time-based variable (x), numeric variable (y)
- Optionally, you can have another variable for color

In [4]:
df_texas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8602 entries, 0 to 8601
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   rownames   8602 non-null   int64  
 1   city       8602 non-null   object 
 2   year       8602 non-null   int64  
 3   month      8602 non-null   int64  
 4   sales      8034 non-null   float64
 5   volume     8034 non-null   float64
 6   median     7986 non-null   float64
 7   listings   7178 non-null   float64
 8   inventory  7135 non-null   float64
 9   date       8602 non-null   float64
dtypes: float64(6), int64(3), object(1)
memory usage: 672.2+ KB


In [7]:
# Create a new variable (year_month)

df_texas['year_month'] = pd.to_datetime(
    df_texas.year.astype(str) + df_texas.month.astype(str),
    format = "%Y%m"
)

df_texas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8602 entries, 0 to 8601
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   rownames    8602 non-null   int64         
 1   city        8602 non-null   object        
 2   year        8602 non-null   int64         
 3   month       8602 non-null   int64         
 4   sales       8034 non-null   float64       
 5   volume      8034 non-null   float64       
 6   median      7986 non-null   float64       
 7   listings    7178 non-null   float64       
 8   inventory   7135 non-null   float64       
 9   date        8602 non-null   float64       
 10  year_month  8602 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(6), int64(3), object(1)
memory usage: 739.4+ KB


In [13]:
# Total number of listings in Texas per Date ("year_month")

df_ex1 = (
    df_texas
    .groupby(['year_month'])
    .agg(
        {
            'listings': 'sum'
        }
    )
    .reset_index()
    .rename(
        columns = {
            'year_month': 'Date',
            'listings': 'Total Active Listings'
        }
    )
)

display(df_ex1.head())

Unnamed: 0,Date,Total Active Listings
0,2000-01-01,75978.0
1,2000-02-01,77071.0
2,2000-03-01,76505.0
3,2000-04-01,79361.0
4,2000-05-01,81906.0


In [34]:
fig = px.line(
    df_ex1,
    x = 'Date',
    y = 'Total Active Listings',
    template = 'plotly_white',
    width = 1000,
    height = 700,
    color_discrete_sequence=['steelblue'],

    # Render the markers on the lineplot
    markers=True,

    # Alter the axis ranges
    range_y = [65000, 190000],

    # Titles
    title = '<b>All-time highs in Texas total listings in 2007, 2008, & 2010</b>',
    subtitle = 'In the state of Texas, the total number of active housing listings peaked in July 2007, May 2008, & July 2010.<br>Perhaps this is correlated with periods of decline in the housing market. September 2001 & August 2005<br>had the lowest number of listings.'
)
fig.update_layout(
    font_family = 'Gill Sans',
    title_font_family = 'Copperplate',
    title_font_size = 28,
    margin = {'t': 200},
    # Axis title font size
    yaxis_title_font_size = 18,
    xaxis_title_font_size = 18,
    # Axis tick font size
    yaxis_tickfont_size = 16,
    xaxis_tickfont_size = 16,
    # Title font color
    title_font_color = 'steelblue'
)

### Example 2

In [36]:
px.line(
    df_texas,
    x = 'year_month',
    y = 'median'
)

In [40]:
# Indicator Variable (is_collin_county)

df_texas.loc[df_texas['city']=='Collin County', 'is_collin_county'] = True
df_texas.loc[df_texas['city']!='Collin County', 'is_collin_county'] = False

print(df_texas['is_collin_county'].value_counts())

display(df_texas.head())

is_collin_county
False    8415
True      187
Name: count, dtype: int64


Unnamed: 0,rownames,city,year,month,sales,volume,median,listings,inventory,date,year_month,is_collin_county
0,1,Abilene,2000,1,72.0,5380000.0,71400.0,701.0,6.3,2000.0,2000-01-01,False
1,2,Abilene,2000,2,98.0,6505000.0,58700.0,746.0,6.6,2000.083333,2000-02-01,False
2,3,Abilene,2000,3,130.0,9285000.0,58100.0,784.0,6.8,2000.166667,2000-03-01,False
3,4,Abilene,2000,4,98.0,9730000.0,68600.0,785.0,6.9,2000.25,2000-04-01,False
4,5,Abilene,2000,5,141.0,10590000.0,67300.0,794.0,6.8,2000.333333,2000-05-01,False


In [53]:
fig = px.line(
    df_texas,
    x = 'year_month',
    y = 'median',
    width = 900,
    height = 600,
    # Argument to specify which variable to 1 line for (each value)
    line_group = 'city',
    color = 'is_collin_county',
    # category_orders={'is_collin_county': [True, False]}
    template='plotly_white',
    color_discrete_sequence=['lightgray', 'tomato'],
    title = '<b>Texan cities finally match Collin County housing prices</b>',
    subtitle = 'While median housing prices in Collin County, Texas were consistently higher from 2000-2006, in recent years<br>they have come much closer and even surpassed the city in some months.'
)
fig.update_layout(
    showlegend = False,
    font_family = 'Gill Sans',
    title_font_family = 'American Typewriter',
    title_font_size = 24,
    margin = {'t': 200},
    # Axis title font size
    yaxis_title_font_size = 18,
    xaxis_title_font_size = 18,
    # Axis tick font size
    yaxis_tickfont_size = 16,
    xaxis_tickfont_size = 16,
)
fig.update_traces(line_width = 1)
fig.show()

### Example 3

In [62]:
px.scatter(
    df_msleep[df_msleep['brainwt']<1.5],
    x = 'bodywt',
    y = 'brainwt',
    template = 'plotly_white',
    height = 750,
    width = 750,
    hover_name = 'name',
    title = "NOOOOOOoooooOOOOOOOooo",
    subtitle = "TOO MUCH WHITESPACE, DO NOT DO THIS ON THE FINAL PROJECT"
)

### Example 4

In [85]:
df_msleep.loc[df_msleep['sleep_rem']>4, 'flag_REM'] = True
df_msleep.loc[df_msleep['sleep_rem']<=4, 'flag_REM'] = False

fig = px.scatter(
    df_msleep,
    x = 'sleep_rem',
    y = 'sleep_total',
    height = 650,
    width = 650,
    template = 'plotly_white',
    title = '<b>More REM sleep tends to equal more total sleep</b>',
    subtitle = 'Generally, the amount of REM sleep is positively correlated with the amount<br>of total sleep for mammals. In addition, the North American Oppossum, Giant<br>armadillo, and the Thick-tailed Oppossum have at least 1 extra hour of REM sleep.',
    trendline = 'ols',
    hover_name = 'name',

    color = 'flag_REM',
    range_y = [0,24],
    trendline_color_override='peru',
    color_discrete_sequence=['crimson', 'darkgrey']
)

fig.update_layout(
    showlegend=False,
    font_family = 'Gill Sans',
    title_font_family = 'American Typewriter',
    title_font_size = 24,
    margin = {'t': 150},
    # Axis title font size
    yaxis_title_font_size = 18,
    xaxis_title_font_size = 18,
    # Axis tick font size
    yaxis_tickfont_size = 16,
    xaxis_tickfont_size = 16,
)
fig.update_traces(
    line_width = 3, 
    line_dash = 'dot',
    marker_size = 8
)
fig.show()