# Variance in Weather
You’re planning a trip to London and want to get a sense of the best time of the year to visit. Luckily, you got your hands on a dataset from 2015 that contains over 39,000 data points about weather conditions in London. Surely, with this much information, you can discover something useful about when to make your trip!

In this project, the data is stored in a Pandas DataFrame. If you’ve never used a DataFrame before, we’ll walk you through how to filter and manipulate this data. If you want to learn more about Pandas, check out our [Data Science Path](https://www.codecademy.com/learn/paths/data-science).

## Explore the Data
1. All of the weather data is stored in a variable named `london_data`.

    Print the first few rows of the dataset by calling `print(london_data.head())`.

    Take a look at the browser to see the columns of this dataset. Here are two questions to ask yourself:

    - How often were measurements taken?
    - Which columns might be the most useful when thinking about planning a trip.

    If you want to see different rows of the data, you can try something like this:

    ```py
    print(london_data.iloc[100:200])
    ```

    This will print rows 100 through 199.

    Comment out these print statements after looking through the results.


#### Hint
By looking at the `"Time"` column, you can see that a data sample is taken three or four times every hour.

The `"TemperatureC"` column might be a good place to start when thinking about planning a trip to London.

In [1]:
import seaborn
import pandas as pd
import numpy as np

# load csv file
london_data = pd.read_csv('weather_london.csv')

# Explore the Data
# take a look at london_data dataset
print(london_data.head())
print(london_data.iloc[100:200])

# data points size
print(london_data.shape)


                  Time  TemperatureC  DewpointC  PressurehPa WindDirection  \
0  2015-01-01 00:00:00           4.6        2.9       1031.7          West   
1  2015-01-01 00:12:00           4.5        2.8       1031.4           WNW   
2  2015-01-01 00:27:00           4.5        2.8       1031.0            SW   
3  2015-01-01 00:42:00           4.8        3.2       1031.7          West   
4  2015-01-01 00:57:00           5.2        3.5       1031.4            NW   

   WindDirectionDegrees  WindSpeedKMH  WindSpeedGustKMH  Humidity  \
0                   273           0.0               1.6        89   
1                   291           0.0               1.6        89   
2                   229           0.0               1.6        89   
3                   281           0.0               4.8        89   
4                   309           0.0               1.6        89   

   HourlyPrecipMM Conditions Clouds  dailyrainMM         SoftwareType  \
0             0.0        nan    nan        

2. Let’s also take a look at how many data points we have. Print `len(london_data)`

In [2]:
print(len(london_data))

39106


## Looking At Temperature
3. Now that we’ve seen what the data looks like, let’s dive into one of the more promising columns — `"TemperatureC"`. This column stores the temperature in Celsius.

    To get a single column from a DataFrame, you can use this syntax:

    one_column = london_data["column_name"]

    Create a variable named `temp` and set it equal to the `"TemperatureC"` column of `london_data`.

In [3]:
# Looking At Temperature
temp = london_data['TemperatureC']
# print(temp.min(), temp.max())
print(np.min(temp), np.max(temp))

-3.8 35.8


4. We can now calculate descriptive statistics about this column. To begin, find the average temperature in London in 2015. Store it in a variable named `average_temp`.

In [4]:
# average temperatures
avg_temp = np.mean(temp)
print(avg_temp)  # avg temp is 12.08 C

12.081969518743925


5. Calculate the variance of the temperature column and store the results in the variable `temperature_var`. Print the results.

In [5]:
# variance of temp
temp_var = np.var(temp)
print(temp_var)  # variance is 29.72

29.715642528199353


6. Calculate the standard deviation of the temperature column and store a variable named `temperature_standard_deviation`. Print this variable.

    How would the variance and standard deviation help you plan a trip?

In [6]:
# standard deviation of temp
temp_std_dev = np.std(temp)
print(temp_std_dev)  # std dev is 5.45

5.4512056031853495


## Filtering By Month
7. The statistics we just calculated aren’t very helpful when trying to plan a vacation since they describe the weather throughout an entire year.

    If we could find a way to use the rows from only a certain month, that might help us find the best month to plan our trip.

    Once again, print `london_data.head()` to see the first few columns of our DataFrame. Which column will help us get only the data points from January? In the browser you can scroll to the right to see more columns.


#### Hint
Take a look at the `"month"` column. The first few rows contain `1`s, meaning those data points came from January.

If you look at `london_data.tail()` you’ll see the data points at the end of the dataset all have `12`s, meaning those data points came from December.

In [7]:
# Filter by month
print(london_data.month.head())
print(london_data.month.tail())

0    1
1    1
2    1
3    1
4    1
Name: month, dtype: int64
39101    12
39102    12
39103    12
39104    12
39105    12
Name: month, dtype: int64


8. We want to filter by the `"month"` column! The following line of code will create a variable that gets the temperature from the rows where `"month"` is `6`. These will be all of the rows from the month of June.

    ```py
    june = london_data.loc[london_data["month"] == 6]["TemperatureC"]
    ```
    Create this variable for June.

In [8]:
# june temp
june = london_data.loc[london_data["month"] == 6]["TemperatureC"]
print(june)

17469     8.9
17470     8.8
17471     8.6
17472     8.2
17473     8.0
         ... 
20655    20.2
20656    20.0
20657    19.9
20658    19.9
20659    19.8
Name: TemperatureC, Length: 3191, dtype: float64


9. Create a variable named `july` that contains all of the data points from July. The code to do this should look very similar to your code that created the June variable. This time, we’re interested in month `7`.

In [9]:
# july temp
july = london_data.loc[london_data["month"] == 7]["TemperatureC"]
print(july)

20660    19.9
20661    19.9
20662    20.0
20663    20.1
20664    20.0
         ... 
23529    13.8
23530    13.7
23531    13.3
23532    12.8
23533    12.4
Name: TemperatureC, Length: 2874, dtype: float64


10. Calculate and print the mean temperature in London for both June and July using the `np.mean() `function.

    What do these numbers tell you? If you wanted to visit London on the month that was, on average, cooler, which month would you pick? Look at the hint to see our thoughts!

In [10]:
# avg temp for june and july
print(np.mean(june))  # june avg temp: 17.05
print(np.mean(july))  # july avg temp: 18.78

17.04728925101849
18.775608907446067


11. Calculate and print the standard deviation of temperature in London for both June and July. Remember, the function you should use is `np.std()`.

    What do these numbers tell you? How might the standard deviation change your decision on when to visit London? Click on the hint to see our thoughts.

In [11]:
# std dev june and july
print(np.std(june))  # std dev june: 4.60
print(np.std(july))  # std dev july: 4.14

4.597909204651791
4.136377318662126


12. If you want to quickly see the mean and standard deviation of every month, use this block of code.

  ```py
  for i in range(1, 13):
    month = london_data.loc[london_data["month"] == i]["TemperatureC"]
    print("The mean temperature in month "+str(i) +" is "+ str(np.mean(month)))
    print("The standard deviation of temperature in month "+str(i) +" is "+ str(np.std(month)) +"\n")
  ```

  During which month would you most like to visit? If you wanted to pick the month with the least variable temperature, which one would you pick?


#### Hint
To us, September (month `9`) looks like a nice month to visit. The temperatures aren’t too high on average, and there isn’t a lot of variability.

The month with the lowest variability is December (month `12`) with a standard deviation of `2.35`. This means that even though the average is a cold `11.2` degrees Celsius, it rarely gets more than 4 or 5 degrees away from that average.

In [12]:
# month average temp and std deviation
for i in range(1, 13):
    month = london_data.loc[london_data["month"] == i]["TemperatureC"]
    print("The mean temperature in month " + str(i) + " is " + str(np.mean(month)))
    print("The standard deviation of temperature in month " + str(i) + " is " + str(np.std(month)) + "\n")

The mean temperature in month 1 is 5.663950927964769
The standard deviation of temperature in month 1 is 3.5630186427365724

The mean temperature in month 2 is 5.089492367767128
The standard deviation of temperature in month 2 is 2.6533914088938397

The mean temperature in month 3 is 7.935419026047565
The standard deviation of temperature in month 3 is 2.7389280396346902

The mean temperature in month 4 is 10.553368744758178
The standard deviation of temperature in month 4 is 4.081741047920952

The mean temperature in month 5 is 13.597456461961503
The standard deviation of temperature in month 5 is 3.521941235413068

The mean temperature in month 6 is 17.04728925101849
The standard deviation of temperature in month 6 is 4.597909204651791

The mean temperature in month 7 is 18.775608907446067
The standard deviation of temperature in month 7 is 4.136377318662126

The mean temperature in month 8 is 18.002258355916894
The standard deviation of temperature in month 8 is 3.468858883252718

T

## Explore on Your Own
13. By looking at the mean and standard deviation of the temperature in London during each month of the year, we can get a sense of the best time to visit.

    Looking at the spread of the data is an important statistic to consider if you are particularly sensitive to extreme days. For example, if you pick a month with a large standard deviation, you might have one day that is relatively cold while the following day is very hot.

    Take some time to see if you can find more insights in this dataset. Here are some ideas we have for you:

    - Look at columns other than `"TemperatureC"`. Can you find something interesting about the humidity or the air pressure? Can you find the rainiest month? London is notoriously rainy!
    - Filter based on `"hour"`. Similar to how you filtered based on the month, are there certain hours that have higher variance than others?

In [13]:
# # Explore Your Own
# hourly average precipitation
for i in range(0, 24):
    hour = london_data.loc[london_data["hour"] == i]["dailyrainMM"]
    print("The mean precipitation in hour "+str(i) + " is " + str(np.mean(hour)))

# avg rain by hour
avg_rain_hour = london_data.groupby(['hour'])['dailyrainMM'].mean()
print(avg_rain_hour)


# avg temp by month using groupby
avg_temp_month = london_data.groupby(['month'])['TemperatureC'].mean().round(2)
print("Avg Temp by Month:", avg_temp_month.T)

# standard deviation of temp by month using groupby
std_temp_month = london_data.groupby(['month'])['TemperatureC'].apply(np.std).round(2)
print("Standard Deviation Temp by Month:", std_temp_month.T)

# Combine avg_temp_month and std_temp_month into a single DataFrame
combined_df = pd.concat([avg_temp_month, std_temp_month], axis=1)
combined_df.columns = ['Avg Temperature (C)', 'Standard Deviation']

print(combined_df.T)

# avg wind speed (kmh) by month
avg_windspkmh_month = london_data.groupby(['month'])['WindSpeedKMH'].mean()
# avg humidity by month
avg_humidity_month = london_data.groupby(['month'])['Humidity'].mean()
# avg daily rain by month
avg_dailyrain_month = london_data.groupby(['month'])['dailyrainMM'].mean()

combined_df['Wind Speed (KMH)'] = avg_windspkmh_month.round(2)
combined_df['Humidity'] = avg_humidity_month.round(2)
combined_df['Daily Rain (MM)'] = avg_dailyrain_month.round(2)

print(combined_df.T)

The mean precipitation in hour 0 is 0.025401730531520398
The mean precipitation in hour 1 is 0.058186244674376136
The mean precipitation in hour 2 is 0.0852439024390244
The mean precipitation in hour 3 is 0.125
The mean precipitation in hour 4 is 0.16833740831295843
The mean precipitation in hour 5 is 0.2240684178375076
The mean precipitation in hour 6 is 0.28304562268803946
The mean precipitation in hour 7 is 0.3366402921485088
The mean precipitation in hour 8 is 0.42326580724370777
The mean precipitation in hour 9 is 0.47336139506915215
The mean precipitation in hour 10 is 0.5746804625684724
The mean precipitation in hour 11 is 0.6056558363417569
The mean precipitation in hour 12 is 0.6881539386650632
The mean precipitation in hour 13 is 0.7495747266099635
The mean precipitation in hour 14 is 0.8311814859926919
The mean precipitation in hour 15 is 0.9056755089450957
The mean precipitation in hour 16 is 1.0322942643391522
The mean precipitation in hour 17 is 1.1031269543464666
The mea

In [35]:
# Pivot Table Temperatures by Month-Hour
# avg temp by month hours
avg_temp_month_hour = london_data.groupby(['month', 'hour'])['TemperatureC'].mean().round(2)
# Reset index to convert the multi-index into columns
avg_temp_month_hour = avg_temp_month_hour.reset_index()
# Pivot the DataFrame to display rows by month and columns by hour
pivot_temp = avg_temp_month_hour.pivot(index='month', columns='hour', values='TemperatureC')
# print(pivot_temp.T)
# Rename index to display month names (assuming you have a mapping of month numbers to their abbreviated names)
month_names = {1: 'JAN', 2: 'FEB', 3: 'MAR', 4: 'APR', 5: 'MAY', 6: 'JUN', 7: 'JUL', 8: 'AUG', 9: 'SEP', 10: 'OCT',
               11: 'NOV', 12: 'DEC'}
pivot_temp = pivot_temp.rename(index=month_names)
# Add a row for the total of each column
pivot_temp.loc['Tot_Avg_Hour'] = pivot_temp.mean(axis=0).round(2)
# Calculate the total average for each hour
mean_row = pivot_temp.loc['Tot_Avg_Hour']
percent_row = (mean_row / mean_row.sum())
# Calculate the total average for each month
pivot_temp['Tot_Avg_Month'] = pivot_temp.mean(axis=1).round(2)

# max value by column total
print('Max hour value observed:', pivot_temp.loc['Tot_Avg_Hour'].idxmax())  # hour 14 have the highest temparature rate

pivot_temp.T


Max value observed by Hours 14


month,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,Tot_Avg_Hour
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,5.09,3.94,6.76,7.75,11.05,13.54,16.14,15.57,11.95,10.94,10.05,10.87,10.3
1,5.06,3.9,6.55,7.52,10.41,12.97,15.45,15.21,11.59,10.87,9.87,10.66,10.01
2,4.95,3.63,6.3,7.11,10.0,12.59,14.95,14.96,11.36,10.58,9.69,10.64,9.73
3,5.0,3.52,5.99,6.81,9.78,12.28,14.7,14.78,11.07,10.45,9.75,10.63,9.56
4,4.92,3.52,5.67,6.56,9.7,12.07,14.64,14.72,10.93,10.25,9.61,10.6,9.43
5,4.92,3.53,5.53,6.39,10.14,12.84,15.19,14.79,10.93,10.22,9.61,10.6,9.56
6,4.98,3.43,5.39,6.95,11.3,14.79,16.73,15.67,11.16,9.92,9.62,10.64,10.05
7,4.96,3.59,5.75,8.46,12.64,16.35,18.13,16.96,12.21,10.12,9.64,10.65,10.79
8,4.85,4.15,6.74,10.07,13.87,17.25,18.88,18.2,13.62,10.96,9.98,10.7,11.61
9,5.29,5.08,7.64,11.42,14.73,18.44,19.82,19.51,14.92,12.05,10.69,11.14,12.56


In [None]:
print(pivot_temp.loc['Tot_Avg_Hour'].idxmax())  # hour 14 have the highest temparature rate


In [39]:
# Pivot Table Precipitation (MM) by Month-Hour
# avg temp by month hours
avg_rain_month_hour = london_data.groupby(['month', 'hour'])['dailyrainMM'].mean().round(2)
# Reset index to convert the multi-index into columns
avg_rain_month_hour = avg_rain_month_hour.reset_index()
# Pivot the DataFrame to display rows by month and columns by hour
pivot_rain = avg_rain_month_hour.pivot(index='month', columns='hour', values='dailyrainMM')
# print(pivot_rain.T)
# Rename index to display month names (assuming you have a mapping of month numbers to their abbreviated names)
month_names = {1: 'JAN', 2: 'FEB', 3: 'MAR', 4: 'APR', 5: 'MAY', 6: 'JUN', 7: 'JUL', 8: 'AUG', 9: 'SEP', 10: 'OCT',
               11: 'NOV', 12: 'DEC'}
pivot_rain = pivot_rain.rename(index=month_names)
# Add a row for the total average of each hour
pivot_rain.loc['Tot_Avg_Hour'] = pivot_rain.mean(axis=0).round(2)
# Add a column for the total average of each month
pivot_rain['Tot_Avg_Month'] = pivot_rain.mean(axis=1).round(2)

# max value by column total
print('Max hour value observed:', pivot_rain.loc['Tot_Avg_Hour'].idxmax())  # hour 22 have the highest rain rate

pivot_rain.T


Max hour value observed: 22


month,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,Tot_Avg_Hour
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,0.0,0.0,0.0,0.02,0.08,0.01,0.03,0.02,0.02,0.0,0.02,0.09,0.02
1,0.01,0.04,0.01,0.04,0.09,0.02,0.14,0.03,0.07,0.0,0.03,0.22,0.06
2,0.04,0.05,0.01,0.13,0.1,0.04,0.17,0.06,0.1,0.01,0.07,0.28,0.09
3,0.1,0.05,0.02,0.2,0.13,0.11,0.18,0.09,0.17,0.0,0.14,0.33,0.13
4,0.15,0.05,0.03,0.22,0.15,0.22,0.16,0.1,0.32,0.04,0.2,0.38,0.17
5,0.26,0.05,0.08,0.21,0.22,0.26,0.21,0.12,0.48,0.1,0.31,0.4,0.22
6,0.37,0.05,0.13,0.19,0.23,0.27,0.25,0.19,0.59,0.23,0.53,0.45,0.29
7,0.47,0.04,0.2,0.22,0.23,0.3,0.3,0.24,0.65,0.37,0.6,0.48,0.34
8,0.63,0.04,0.23,0.31,0.25,0.31,0.37,0.33,0.78,0.49,0.89,0.51,0.43
9,0.75,0.04,0.22,0.35,0.23,0.32,0.46,0.61,0.9,0.44,1.04,0.52,0.49
