# Lecture 08 Review of Histograms

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True

In [None]:
top_movies = Table.read_table('data/top_movies_2017.csv')

In [None]:
this_year = 2023
ages = this_year - top_movies.column('Year')
top_movies = top_movies.with_column('Age', ages)
top_movies

# Grouping by categorical variable
We can group our movies by the *categorical* variable 'Studio', to find out how many movies were produced by each studio, using the `group()` function.

we can then go ahead and plot those in a bar plot. Remember. The First parameter is the variable to use on x-axis, second parameter is the variable to use on the y-axis. Check the [`bar()` documentation](https://www.data8.org/datascience/_autosummary/datascience.tables.Table.barh.html#datascience.tables.Table.bar)

Let's make it better
- Sort the bars in decreasing order
- Only show the 10 largest studios
- Turn it around by 90 degrees

# Grouping by numerical variable
We want to find out something about the age distribition of movies. I.e. seeing how many movies were produced in different age brackets.

We fist calculate the age of all movies (as of 2023)

we then group by the years using the `group()` function

# Binning
This was a bit boring. We'd much rather create some groups (bins) in which we aggregate a ranges of ages.
I.e. count all moves that are 0-10 years old, all movies that are 10-20 years old etc.

Now make bar plot

# Making a histogram
We are not fully satisfied. 
- The movies in the first bar are not e.g. 0 years old. They are between 0 and 10.
- Even worse: What happens if our bins are not uniform ... we loose our perspective on the x-axis

In [None]:
my_bins = make_array(0, 10, 20, 40, 65, 102)
...

We can fix this by using the [`hist()` function](http://www.data8.org/datascience/_autosummary/datascience.tables.Table.hist.html#datascience.tables.Table.hist)

## What does the "percent per unit" mean?

Let's inspect our binned data

In [None]:
binned_data

Add a column containing the percentage of data in each bin. Hint: use `sum()` function.

## Bin width
now we calculate the bin width. I.e. how many years each bin spans. We use the `np.diff()` function

we want to add a new column to the table. However, the `bin_width` is one element shorter than the table.
We therefore shorten the table by one

In [None]:
binned_data_ = binned_data.take(np.arange(binned_data.num_rows - 1)) 
binned_data_ = binned_data_.with_column('Width', width)
binned_data_

Now we can calculate the percent per unit

In [None]:
percent_per_unit = binned_data_.column('Percent') / binned_data_.column('Width')
percent_per_unit

In [None]:
binned_data_.with_column('percent per unit', percent_per_unit)

In [None]:
top_movies.hist('Age', bins = my_bins, unit='Year')

# Discussion question
https://ipm.ucanr.edu/calludt.cgi/WXSTATIONDATA?STN=STBARBRA.C

- University of California Statewide Integrated Pest Management Program
- How to Manage Pests: California Weather Data
- Retrieve data in comma delimited data file format
 
Weather database request:  
 
Time Period: January 1, 1993 to January 1, 2023, retrieved on May 2, 2023"
 (10958 days)

|Variable  | Description                  |Units                         |
|:--    | :--| :-- |
|   1      | Database name                |                              |
|   2      | Date: year,month,day         |yyyymmdd"                     |
|   3      | Observation time             |hhmm                          |
|   4      | Precipitation, amount        |Millimeters"                  |
|   5      | Precipitation, type          |(coded)"                      |
|   6      | Air temperature, maximum     |Celsius"                      |
|   7      | Air temperature, minimum     |Celsius"                      |
|   8      | Air temperature, observed    |Celsius"                      |
|   9      | Weather conditions           |(coded)"                      |
|  10      | Wind, direction              |N,NE,E,SE,S,SW,W,NW, 0=calm"  |
|  11      | Wind, speed                  |Meters per second"            |
|  12      | Bulb temperature, wet        |Celsius"                      |
|  13      | Bulb temperature, dry        |Celsius"                      |
|  14      | Soil temperature, maximum    |Celsius"                      |
|  15      | Soil temperature, minimum    |Celsius"                      |
|  16      | Pan evaporation              |Millimeters"                  |
|  17      | Solar radiation              |Watts per sq. meters"         |
|  18      | Reference evapotranspiration |Millimeters"                  |
|  19      | Relative humidity, minimum   |Percent"                      |
|  20      | Relative humidity, maximum   |Percent                       |



Weather Type
Weather type contains information about the weather at the observation time. Also, if a significant weather event occurs during a day, hail or a tornado for instance, the occurrence may be noted in this field.

|Code  | Meaning    |
| :--   | :--|
|C	|Clear	                                   |
|R	|Rain                                      |
|PC	|Partly cloudy	                           |
|R+	|Heavy rain                                |
|CY	|Cloudy	                                   |
|W	|Rain showers                              |
|HZ	|Haze or smoke	                           |
|W+	|Heavy showers                             |
|DS	|Dust storm	                               |
|S	|Snow                                      |
|F	|Fog	                                   |
|S+	|Heavy snow                                |
|F+	|Heavy fog	                               |
|BS	|Blowing snow                              |
|DZ	|Drizzle	                               |
|IP	|Sleet                                     |
|TH	|Thunderstorm	                           |
|HL	|Hail                                      |
|TO	|Tornado	                               |
|SR	|Snow and rain mixed                       |
|T	|Thunder, no rain	                       |
|HW	|High winds                                |
|L	|Lightning, no thunder	DW	Dew present    |

In [None]:
weather = Table.read_table('data/sb_weather2.csv')
weather

In [None]:
(weather
    .group('Wx')
    .barh('Wx', 'count')
)

In [None]:
weather.hist('Air max')

In [None]:
weather.scatter('Air max', 'min')