# Pandas Exercises
## Basics

We'll go through the basics of pandas using a year's worth of weather data 
from Weather Underground.

In [4]:
import pandas

data = pandas.read_csv("https://raw.githubusercontent.com/evdoks/" \
                       "data_science/master/data/weather_year.csv")

Get some information about the DataFrame

Let's explore the content of the DataFrame

In [5]:
print("Number of rows: ")
print(len(data))

print("\nColumns: ")
print(data.columns)

Number of rows: 
366

Columns: 
Index(['EDT', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF',
       'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity',
       ' Mean Humidity', ' Min Humidity', ' Max Sea Level PressureIn',
       ' Mean Sea Level PressureIn', ' Min Sea Level PressureIn',
       ' Max VisibilityMiles', ' Mean VisibilityMiles', ' Min VisibilityMiles',
       ' Max Wind SpeedMPH', ' Mean Wind SpeedMPH', ' Max Gust SpeedMPH',
       'PrecipitationIn', ' CloudCover', ' Events', ' WindDirDegrees'],
      dtype='object')


In [7]:
# access single column
print("EDT column:")
print(data["EDT"].head())  # data.EDT will also work

EDT column:
0    2012-3-10
1    2012-3-11
2    2012-3-12
3    2012-3-13
4    2012-3-14
Name: EDT, dtype: object


In [8]:
# access several columns
print("\nEDT and TemperatureF columna:")
print(data[["EDT", "Mean TemperatureF"]].head())


EDT and TemperatureF columna:
         EDT  Mean TemperatureF
0  2012-3-10                 40
1  2012-3-11                 49
2  2012-3-12                 62
3  2012-3-13                 63
4  2012-3-14                 62


### Exercise 1

How would we get the second to last date (EDT) in the dataset?

In [5]:
# Combine head() and tail()
last_two_dates = data.EDT.tail(2)
second_to_last_date = last_two_dates.head(1)

print(second_to_last_date)

364    2013-3-9
Name: EDT, dtype: object


## Working with columns

Rename individual columns with the `rename()` method of 
the DataFrame.

In [6]:
data = data.rename(columns={ "Max TemperatureF": "max_temp", 
                            "Min TemperatureF": "min_temp" })
print(data.columns)

Index(['EDT', 'max_temp', 'Mean TemperatureF', 'min_temp', 'Max Dew PointF',
       'MeanDew PointF', 'Min DewpointF', 'Max Humidity', ' Mean Humidity',
       ' Min Humidity', ' Max Sea Level PressureIn',
       ' Mean Sea Level PressureIn', ' Min Sea Level PressureIn',
       ' Max VisibilityMiles', ' Mean VisibilityMiles', ' Min VisibilityMiles',
       ' Max Wind SpeedMPH', ' Mean Wind SpeedMPH', ' Max Gust SpeedMPH',
       'PrecipitationIn', ' CloudCover', ' Events', ' WindDirDegrees'],
      dtype='object')


Modify all  of our column names at the same time. 
This is as easy as assigning a new list of column names to the `columns` 
property of the DataFrame.

In [7]:
data.columns = ["date", "max_temp", "mean_temp", "min_temp", "max_dew",
                "mean_dew", "min_dew", "max_humidity", "mean_humidity",
                "min_humidity", "max_pressure", "mean_pressure",
                "min_pressure", "max_visibilty", "mean_visibility",
                "min_visibility", "max_wind", "mean_wind", "min_wind",
                "precipitation", "cloud_cover", "events", "wind_dir"]
data.head()

Unnamed: 0,date,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
0,2012-3-10,56,40,24,24,20,16,74,50,26,...,10,10,10,13,6,17.0,0.00,0,,138
1,2012-3-11,67,49,30,43,31,24,78,53,28,...,10,10,10,22,7,32.0,T,1,Rain,163
2,2012-3-12,71,62,53,59,55,43,90,76,61,...,10,10,6,24,14,36.0,0.03,6,Rain,190
3,2012-3-13,76,63,50,57,53,47,93,66,38,...,10,10,4,16,5,24.0,0.00,0,,242
4,2012-3-14,80,62,44,58,52,43,93,68,42,...,10,10,10,16,6,22.0,0.00,0,,202


Now all columns can be accessed with dot notation!

Mean temperature

In [8]:
data.mean_temp.head()

0    40
1    49
2    62
3    63
4    62
Name: mean_temp, dtype: int64

Standard deviation for mean temperature

In [9]:
data.mean_temp.std()

18.43650599625107

Standard deviation for all columns

In [10]:
data.std()

max_temp           20.361247
mean_temp          18.436506
min_temp           17.301141
max_dew            16.397178
mean_dew           16.829996
min_dew            17.479449
max_humidity        9.108438
mean_humidity       9.945591
min_humidity       15.360261
max_pressure        0.172189
mean_pressure       0.174112
min_pressure        0.182476
max_visibilty       0.073821
mean_visibility     1.875406
min_visibility      3.792219
max_wind            5.564329
mean_wind           3.200940
min_wind            8.131092
cloud_cover         2.707261
wind_dir           94.045080
dtype: float64

Standard deviation for a subset of columns

In [11]:
data[["max_temp", "max_pressure", "max_wind"]].std()

max_temp        20.361247
max_pressure     0.172189
max_wind         5.564329
dtype: float64

### Exercise 2:

What is the range of temperatures in the dataset?

**Hint:** columns have `max()` and `min()` methods.

In [12]:
hottest_temp = data.max_temp.max()  # Highest of the highs
coldest_temp = data.min_temp.min()  # Lowest of the lows
print("Temperature range:", hottest_temp - coldest_temp, "degrees F")

Temperature range: 105 degrees F


## Bulk Operations with `apply()`

Built-on methods like `sum()` and `std()` work on entire columns. 
We can run our own functions across all values in a column (or row) 
using `apply()`.

Let's convert strings representing dates in EDT column to `datetime.datetime` 
type.

First, let's examine the string date representations:

In [13]:
first_date = data.date.values[0]
print(first_date, "is a", type(first_date))

2012-3-10 is a <class 'str'>


The strptime function from the `datetime` module converts date string.

In [14]:
# Import the datetime class from the datetime module
from datetime import datetime

# Convert date string to datetime object
datetime.strptime(first_date, "%Y-%m-%d")

datetime.datetime(2012, 3, 10, 0, 0)

Apply our coversion procedure to the all values of `data.EDT` column

In [15]:
def string_to_date(date_string):
    # Define a function to convert strings to dates
    return datetime.strptime(date_string, "%Y-%m-%d")

# Run the function on every date string and overwrite the column
data['date_1'] = data.date.apply(string_to_date)
print(data.date.head())

0    2012-3-10
1    2012-3-11
2    2012-3-12
3    2012-3-13
4    2012-3-14
Name: date, dtype: object


### Exercise 3

Perform the same conversion from a string to a `datetime` by using `lambda`

In [16]:
data['date_2'] = data.date.apply(lambda s: datetime.strptime(s, "%Y-%m-%d"))
print(data.date.head())

0    2012-3-10
1    2012-3-11
2    2012-3-12
3    2012-3-13
4    2012-3-14
Name: date, dtype: object


## Indexes

Each row in our DateFrame represents the weather from a single day. 
Each row in a DataFrame is associated with an index, which is a label 
that uniquely identifies a row.

Our row indices up to now have been auto-generated by pandas, and are simply 
integers from 0 to 365. If we use dates instead of integers for our index, 
we will get some extra benefits from pandas when plotting later on. 
Overwriting the index is as easy as assigning to the index property 
of the DataFrame.

In [17]:
data.index = data.date_1
print(data.index)

DatetimeIndex(['2012-03-10', '2012-03-11', '2012-03-12', '2012-03-13',
               '2012-03-14', '2012-03-15', '2012-03-16', '2012-03-17',
               '2012-03-18', '2012-03-19',
               ...
               '2013-03-01', '2013-03-02', '2013-03-03', '2013-03-04',
               '2013-03-05', '2013-03-06', '2013-03-07', '2013-03-08',
               '2013-03-09', '2013-03-10'],
              dtype='datetime64[ns]', name='date_1', length=366, freq=None)


Accessing rows of the DataFrame by index

We can look up a row by its date (index label) with the loc[] property

In [18]:
date = datetime(2012, 8, 19)
print("Weather on ", date)
print(data.loc[date])

Weather on  2012-08-19 00:00:00
date                         2012-8-19
max_temp                            82
mean_temp                           67
min_temp                            51
max_dew                             56
mean_dew                            50
min_dew                             42
max_humidity                        96
mean_humidity                       62
min_humidity                        28
max_pressure                     29.95
mean_pressure                    29.92
min_pressure                     29.89
max_visibilty                       10
mean_visibility                     10
min_visibility                      10
max_wind                            14
mean_wind                            3
min_wind                            21
precipitation                     0.00
cloud_cover                          1
events                             NaN
wind_dir                             1
date_1             2012-08-19 00:00:00
date_2             2012-08-19 00

We can even slice out a while month using a list-like syntax

In [19]:
print("\nWeather for a range of dates")
print(data[datetime(2012, 4, 1):datetime(2012, 4, 7)])


Weather for a range of dates
                date  max_temp  mean_temp  min_temp  max_dew  mean_dew  \
date_1                                                                   
2012-04-01  2012-4-1        79         64        48       62        52   
2012-04-02  2012-4-2        68         62        55       60        51   
2012-04-03  2012-4-3        84         69        53       60        51   
2012-04-04  2012-4-4        68         62        55       59        54   
2012-04-05  2012-4-5        61         53        44       43        35   
2012-04-06  2012-4-6        61         48        35       35        31   
2012-04-07  2012-4-7        67         50        32       43        31   

            min_dew  max_humidity  mean_humidity  min_humidity    ...      \
date_1                                                            ...       
2012-04-01       44            96             75            54    ...       
2012-04-02       45            97             73            48    ...   

With all of the dates in the index now, we no longer need the "data", "date_1" 
and date_2" columns. Let's drop them.

In [20]:
data = data.drop(["date", "date_1", "date_2"], axis=1) 
print(data.columns)

Index(['max_temp', 'mean_temp', 'min_temp', 'max_dew', 'mean_dew', 'min_dew',
       'max_humidity', 'mean_humidity', 'min_humidity', 'max_pressure',
       'mean_pressure', 'min_pressure', 'max_visibilty', 'mean_visibility',
       'min_visibility', 'max_wind', 'mean_wind', 'min_wind', 'precipitation',
       'cloud_cover', 'events', 'wind_dir'],
      dtype='object')


### Exercise 4

Print out the cloud cover for each day in May.

**Hint:** you can make datetime objects with the `datetime(year, month, day)` 
function

In [21]:
print(datetime(2012, 5, 1))  # May 1st of 2012

data[datetime(2012, 5, 1):datetime(2012, 5, 31)].cloud_cover

2012-05-01 00:00:00


date_1
2012-05-01    6
2012-05-02    1
2012-05-03    0
2012-05-04    6
2012-05-05    3
2012-05-06    0
2012-05-07    5
2012-05-08    4
2012-05-09    3
2012-05-10    1
2012-05-11    0
2012-05-12    1
2012-05-13    4
2012-05-14    4
2012-05-15    0
2012-05-16    0
2012-05-17    0
2012-05-18    0
2012-05-19    0
2012-05-20    1
2012-05-21    4
2012-05-22    2
2012-05-23    0
2012-05-24    0
2012-05-25    2
2012-05-26    0
2012-05-27    0
2012-05-28    0
2012-05-29    4
2012-05-30    1
2012-05-31    4
Name: cloud_cover, dtype: int64

## Handling missing values

Pandas considers values like `NaN` and `None` to represent missing data. 
The `pandas.isnull()` function can be used to tell whether or not a value is 
missing.

We can use apply() across all of the columns in our DataFrame to figure out 
which values are missing.

In [22]:
empty = data.apply(lambda col: pandas.isnull(col))
empty.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 366 entries, 2012-03-10 to 2013-03-10
Freq: D
Data columns (total 22 columns):
max_temp           366 non-null bool
mean_temp          366 non-null bool
min_temp           366 non-null bool
max_dew            366 non-null bool
mean_dew           366 non-null bool
min_dew            366 non-null bool
max_humidity       366 non-null bool
mean_humidity      366 non-null bool
min_humidity       366 non-null bool
max_pressure       366 non-null bool
mean_pressure      366 non-null bool
min_pressure       366 non-null bool
max_visibilty      366 non-null bool
mean_visibility    366 non-null bool
min_visibility     366 non-null bool
max_wind           366 non-null bool
mean_wind          366 non-null bool
min_wind           366 non-null bool
precipitation      366 non-null bool
cloud_cover        366 non-null bool
events             366 non-null bool
wind_dir           366 non-null bool
dtypes: bool(22)
memory usage: 20.7 KB


There are some missing values because a True was returned from pandas.isnull.

In [23]:
print(empty.events.head(10))
print(data.events.head(10))

date_1
2012-03-10     True
2012-03-11    False
2012-03-12    False
2012-03-13     True
2012-03-14     True
2012-03-15    False
2012-03-16     True
2012-03-17    False
2012-03-18    False
2012-03-19     True
Freq: D, Name: events, dtype: bool
date_1
2012-03-10                  NaN
2012-03-11                 Rain
2012-03-12                 Rain
2012-03-13                  NaN
2012-03-14                  NaN
2012-03-15    Rain-Thunderstorm
2012-03-16                  NaN
2012-03-17     Fog-Thunderstorm
2012-03-18                 Rain
2012-03-19                  NaN
Freq: D, Name: events, dtype: object


Filtering out rows with missing 'event' values

In [24]:
data.dropna(subset=["events"]).info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 162 entries, 2012-03-11 to 2013-03-06
Data columns (total 22 columns):
max_temp           162 non-null int64
mean_temp          162 non-null int64
min_temp           162 non-null int64
max_dew            162 non-null int64
mean_dew           162 non-null int64
min_dew            162 non-null int64
max_humidity       162 non-null int64
mean_humidity      162 non-null int64
min_humidity       162 non-null int64
max_pressure       162 non-null float64
mean_pressure      162 non-null float64
min_pressure       162 non-null float64
max_visibilty      162 non-null int64
mean_visibility    162 non-null int64
min_visibility     162 non-null int64
max_wind           162 non-null int64
mean_wind          162 non-null int64
min_wind           162 non-null float64
precipitation      162 non-null object
cloud_cover        162 non-null int64
events             162 non-null object
wind_dir           162 non-null int64
dtypes: float64(4), int64(16),

In [25]:
data.dropna(subset=["events"]).head(10)

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-03-11,67,49,30,43,31,24,78,53,28,30.37,...,10,10,10,22,7,32.0,T,1,Rain,163
2012-03-12,71,62,53,59,55,43,90,76,61,30.13,...,10,10,6,24,14,36.0,0.03,6,Rain,190
2012-03-15,79,69,58,61,58,53,90,69,48,30.13,...,10,10,10,31,10,41.0,0.04,3,Rain-Thunderstorm,209
2012-03-17,78,62,46,60,54,46,100,78,56,30.15,...,10,5,0,12,5,17.0,T,3,Fog-Thunderstorm,162
2012-03-18,80,70,59,61,58,57,93,69,45,30.14,...,10,10,9,18,8,25.0,T,2,Rain,197
2012-03-22,81,69,57,63,57,51,87,65,42,30.11,...,10,10,2,31,4,41.0,0.14,3,Rain,159
2012-03-23,73,64,55,61,58,54,97,79,61,30.03,...,10,9,2,21,6,24.0,0.86,7,Rain-Thunderstorm,129
2012-03-24,65,56,46,54,49,43,100,80,48,29.88,...,10,8,0,12,5,14.0,0.06,5,Fog-Rain,222
2012-03-28,77,64,51,56,46,33,73,47,21,29.97,...,10,10,10,20,9,25.0,T,2,Thunderstorm,269
2012-03-29,69,58,46,45,39,35,76,55,34,30.08,...,10,10,10,14,6,17.0,T,2,Rain,84


Fill empty values with empty strings.
This can be done with the `fillna()` function. 

In [26]:
data.events = data.events.fillna("")
data.events.head(10)

date_1
2012-03-10                     
2012-03-11                 Rain
2012-03-12                 Rain
2012-03-13                     
2012-03-14                     
2012-03-15    Rain-Thunderstorm
2012-03-16                     
2012-03-17     Fog-Thunderstorm
2012-03-18                 Rain
2012-03-19                     
Freq: D, Name: events, dtype: object

Accessing individual rows

Sometimes you need to access individual rows in your DataFrame. 
 * the `iloc[]` function lets you acces the ith row from a DataFrame (starting
from 0).

In [27]:
data.iloc[0]

max_temp              56
mean_temp             40
min_temp              24
max_dew               24
mean_dew              20
min_dew               16
max_humidity          74
mean_humidity         50
min_humidity          26
max_pressure       30.53
mean_pressure      30.45
min_pressure       30.34
max_visibilty         10
mean_visibility       10
min_visibility        10
max_wind              13
mean_wind              6
min_wind              17
precipitation       0.00
cloud_cover            0
events                  
wind_dir             138
Name: 2012-03-10 00:00:00, dtype: object

 * the `loc[]` lets you acces rows by lables in index.

In [28]:
data.loc[datetime(2013, 1, 1)]

max_temp              32
mean_temp             26
min_temp              20
max_dew               31
mean_dew              25
min_dew               16
max_humidity          92
mean_humidity         83
min_humidity          74
max_pressure        30.2
mean_pressure      30.11
min_pressure       30.04
max_visibilty          9
mean_visibility        5
min_visibility         2
max_wind              14
mean_wind              5
min_wind              15
precipitation          T
cloud_cover            8
events                  
wind_dir             353
Name: 2013-01-01 00:00:00, dtype: object

## Iterating over all rows of DataFrame

You can iterate over each row in the DataFrame with `iterrows()`. 
Note that this function returns both the index and the row. Also, you must 
access columns in the row you get back from `iterrows()` with the dictionary 
syntax.

In [29]:
num_rain = 0
for idx, row in data.iterrows():
    if "Rain" in row["events"]:
        num_rain += 1

print("Days with rain:", num_rain)

Days with rain: 121


### Exercise 5:

Was there any November rain?

**Hint:** check out the `strftime()` function on datetime objects and the 
[documentation](https://docs.python.org/3.6/library/datetime.html#strftime-and-strptime-behavior).

In [30]:
d = datetime(2012, 1, 1)
d.strftime("%B")

november_rain = False
for date_idx, row in data.iterrows():
    if date_idx.strftime("%B") == "November" and "Rain" in row["events"]:
        november_rain = True

if november_rain:
    print("There was rain in November")
else:
    print("There was *not* rain in November")

There was rain in November


## Filtering

Pandas is often used for selecting rows of interest from a DataFrame.

Gets all days with max temp =< 32

In [31]:
freezing_days = data[data.max_temp <= 32]
freezing_days.head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-11-24,31,26,21,20,18,15,81,72,63,30.3,...,10,10,9,9,4,14.0,0.00,4,,270
2012-12-21,29,26,22,25,19,15,85,74,63,30.21,...,10,5,0,25,14,39.0,0.02,7,Fog-Snow,285
2012-12-29,32,28,23,28,25,16,92,80,68,30.29,...,10,3,0,18,9,29.0,0.20,8,Fog-Snow,308
2012-12-30,31,18,4,21,12,1,92,75,58,30.47,...,10,6,0,15,6,21.0,0.00,1,Fog,220
2013-01-01,32,26,20,31,25,16,92,83,74,30.2,...,9,5,2,14,5,15.0,T,8,,353


Filters the DataFrame further by leaving rows with min temp >=20

In [32]:
freezing_days[freezing_days.min_temp >= 20].head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-11-24,31,26,21,20,18,15,81,72,63,30.3,...,10,10,9,9,4,14.0,0.00,4,,270
2012-12-21,29,26,22,25,19,15,85,74,63,30.21,...,10,5,0,25,14,39.0,0.02,7,Fog-Snow,285
2012-12-29,32,28,23,28,25,16,92,80,68,30.29,...,10,3,0,18,9,29.0,0.20,8,Fog-Snow,308
2013-01-01,32,26,20,31,25,16,92,83,74,30.2,...,9,5,2,14,5,15.0,T,8,,353
2013-01-25,30,25,20,18,12,0,74,57,39,30.35,...,10,8,1,16,7,21.0,0.02,6,Snow,192


or doing the same using boolean operation

In [33]:
data[(data.max_temp <= 32) & (data.min_temp >= 20)].head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-11-24,31,26,21,20,18,15,81,72,63,30.3,...,10,10,9,9,4,14.0,0.00,4,,270
2012-12-21,29,26,22,25,19,15,85,74,63,30.21,...,10,5,0,25,14,39.0,0.02,7,Fog-Snow,285
2012-12-29,32,28,23,28,25,16,92,80,68,30.29,...,10,3,0,18,9,29.0,0.20,8,Fog-Snow,308
2013-01-01,32,26,20,31,25,16,92,83,74,30.2,...,9,5,2,14,5,15.0,T,8,,353
2013-01-25,30,25,20,18,12,0,74,57,39,30.35,...,10,8,1,16,7,21.0,0.02,6,Snow,192


### How filtering is working in pandas

It's important to understand what's really going on underneath with filtering. 
Let's look at what kind of object we actually get back when creating a filter.

Filtering condition creates a pandas Series object. 
Because our DataFrame uses datetime objects for the index, 
we have a specialized `TimeSeries` object with boolean values for every item 
in the index.

When the filter is applied, pandas lines up the rows of the `DataFrame` and the 
filter using the index, and then keeps the rows with a `True` filter value.

Filtering condition

In [34]:
temp_max = data.max_temp <= 32

Filtering condition's type:

In [35]:
type(temp_max)

pandas.core.series.Series

Filtering condition:

In [36]:
temp_max.head()

date_1
2012-03-10    False
2012-03-11    False
2012-03-12    False
2012-03-13    False
2012-03-14    False
Freq: D, Name: max_temp, dtype: bool

Apply filter

In [37]:
data[temp_max].head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-11-24,31,26,21,20,18,15,81,72,63,30.3,...,10,10,9,9,4,14.0,0.00,4,,270
2012-12-21,29,26,22,25,19,15,85,74,63,30.21,...,10,5,0,25,14,39.0,0.02,7,Fog-Snow,285
2012-12-29,32,28,23,28,25,16,92,80,68,30.29,...,10,3,0,18,9,29.0,0.20,8,Fog-Snow,308
2012-12-30,31,18,4,21,12,1,92,75,58,30.47,...,10,6,0,15,6,21.0,0.00,1,Fog,220
2013-01-01,32,26,20,31,25,16,92,83,74,30.2,...,9,5,2,14,5,15.0,T,8,,353


More filtering

In [38]:
temp_min = data.min_temp >= 20

Filtering condition:

In [39]:
temp_min.head()

date_1
2012-03-10    True
2012-03-11    True
2012-03-12    True
2012-03-13    True
2012-03-14    True
Freq: D, Name: min_temp, dtype: bool

Boolean operations of filtering condition

In [40]:
(temp_min & temp_max).head()

date_1
2012-03-10    False
2012-03-11    False
2012-03-12    False
2012-03-13    False
2012-03-14    False
Freq: D, dtype: bool

Boolean operations on filter: and (&)

In [41]:
data[temp_min & temp_max].head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-11-24,31,26,21,20,18,15,81,72,63,30.3,...,10,10,9,9,4,14.0,0.00,4,,270
2012-12-21,29,26,22,25,19,15,85,74,63,30.21,...,10,5,0,25,14,39.0,0.02,7,Fog-Snow,285
2012-12-29,32,28,23,28,25,16,92,80,68,30.29,...,10,3,0,18,9,29.0,0.20,8,Fog-Snow,308
2013-01-01,32,26,20,31,25,16,92,83,74,30.2,...,9,5,2,14,5,15.0,T,8,,353
2013-01-25,30,25,20,18,12,0,74,57,39,30.35,...,10,8,1,16,7,21.0,0.02,6,Snow,192


Boolean operations on filter: or (|)

In [42]:
data[temp_min | temp_max].head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-03-10,56,40,24,24,20,16,74,50,26,30.53,...,10,10,10,13,6,17.0,0.00,0,,138
2012-03-11,67,49,30,43,31,24,78,53,28,30.37,...,10,10,10,22,7,32.0,T,1,Rain,163
2012-03-12,71,62,53,59,55,43,90,76,61,30.13,...,10,10,6,24,14,36.0,0.03,6,Rain,190
2012-03-13,76,63,50,57,53,47,93,66,38,30.12,...,10,10,4,16,5,24.0,0.00,0,,242
2012-03-14,80,62,44,58,52,43,93,68,42,30.15,...,10,10,10,16,6,22.0,0.00,0,,202


### Exercise 6

What was the coldest it ever got when there was no cloud cover and no 
precipitation?

In [43]:
# Some rows contain 'T' value for precipitation, , which stands for 
# "trace amount of precipitation". First convert 'T' to a small floating 
# point number
def precipitation_to_float(precip_str):
    if precip_str == "T":
        return 1e-10  # Very small value
    return float(precip_str)

data.precipitation = data.precipitation.apply(precipitation_to_float)

# now find how the coldest it ever got when there was no cloud cover and no 
#precipitation    
no_cloud_cover = (data.cloud_cover == 0)
no_precipitation = (data.precipitation == 0)
coldest_temp = data[no_cloud_cover & no_precipitation].min_temp.min()
print("Coldest without cloud cover and precipitation was", coldest_temp, 
      "degree(s) F")

Coldest without cloud cover and precipitation was 1 degree(s) F


In [44]:
coldest_temp = data[(data.cloud_cover == 0) & 
                    (data.precipitation == 0)].min_temp.min()
coldest_temp

1

## Grouping

Besides apply(), another great DataFrame function is groupby(). 
It will group a DataFrame by one or more columns, and let you iterate 
through each group.

As an example, let's group our DataFrame by the "cloud_cover" column 
(a value ranging from 0 to 8).

In [45]:
cover_temps = {}
for cover, cover_data in data.groupby("cloud_cover"):
    cover_temps[cover] = cover_data.mean_temp.mean()  # The mean mean temp!
cover_temps

{0: 59.73076923076923,
 1: 61.41509433962264,
 2: 59.72727272727273,
 3: 58.0625,
 4: 51.5,
 5: 50.827586206896555,
 6: 57.72727272727273,
 7: 46.5,
 8: 40.90909090909091}

Alternatively, we can use groupby aggregation

In [46]:
data.groupby(["cloud_cover"]).mean_temp.mean()

cloud_cover
0    59.730769
1    61.415094
2    59.727273
3    58.062500
4    51.500000
5    50.827586
6    57.727273
7    46.500000
8    40.909091
Name: mean_temp, dtype: float64

Iterating through the result of `groupby()`, means iterating over tuples. 
The first item is the column value, and the second item is a filtered 
`DataFrame` (where the column equals the first tuple value).

You can group by more than one column as well. 
In this case, the first tuple item returned by `groupby()` will itself 
be a tuple with the value of each column.

In [47]:
for (cover, events), group_data in data.groupby(["cloud_cover", "events"]):
    print("Cover: {0}, Events: {1}, Count: {2}" \
          .format(cover, events, len(group_data)))

Cover: 0, Events: , Count: 99
Cover: 0, Events: Fog, Count: 2
Cover: 0, Events: Rain, Count: 2
Cover: 0, Events: Thunderstorm, Count: 1
Cover: 1, Events: , Count: 35
Cover: 1, Events: Fog, Count: 5
Cover: 1, Events: Fog-Rain, Count: 1
Cover: 1, Events: Rain, Count: 4
Cover: 1, Events: Rain-Thunderstorm, Count: 2
Cover: 1, Events: Thunderstorm, Count: 6
Cover: 2, Events: , Count: 20
Cover: 2, Events: Fog, Count: 1
Cover: 2, Events: Rain, Count: 5
Cover: 2, Events: Rain-Thunderstorm, Count: 4
Cover: 2, Events: Snow, Count: 1
Cover: 2, Events: Thunderstorm, Count: 2
Cover: 3, Events: , Count: 12
Cover: 3, Events: Fog, Count: 2
Cover: 3, Events: Fog-Rain-Thunderstorm, Count: 3
Cover: 3, Events: Fog-Thunderstorm, Count: 1
Cover: 3, Events: Rain, Count: 9
Cover: 3, Events: Rain-Thunderstorm, Count: 4
Cover: 3, Events: Snow, Count: 1
Cover: 4, Events: , Count: 16
Cover: 4, Events: Fog, Count: 3
Cover: 4, Events: Fog-Rain, Count: 2
Cover: 4, Events: Fog-Rain-Thunderstorm, Count: 2
Cover: 4, Ev

### Exersice 7

Count number of elements in each `["cloud_cover", "events"]` group by using 
`groupby` aggregation.

In [48]:
data.groupby(["cloud_cover", "events"]).size()

cloud_cover  events                    
0                                          99
             Fog                            2
             Rain                           2
             Thunderstorm                   1
1                                          35
             Fog                            5
             Fog-Rain                       1
             Rain                           4
             Rain-Thunderstorm              2
             Thunderstorm                   6
2                                          20
             Fog                            1
             Rain                           5
             Rain-Thunderstorm              4
             Snow                           1
             Thunderstorm                   2
3                                          12
             Fog                            2
             Fog-Rain-Thunderstorm          3
             Fog-Thunderstorm               1
             Rain                       

## Creating new columns

Weather events in our DataFrame are stored in strings like "Rain-Thunderstorm" 
to represent that it rained and there was a thunderstorm that day. Let's split 
them out into boolean "rain", "thunderstorm", etc. columns.

First, let's discover the different kinds of weather events we have with 
`unique()`.

In [49]:
data.events.unique()
# if not the whole array is printed out, use 
# [x for x in data.events.unique()]

array(['', 'Rain', 'Rain-Thunderstorm', 'Fog-Thunderstorm', 'Fog-Rain',
       'Thunderstorm', 'Fog-Rain-Thunderstorm', 'Fog', 'Fog-Rain-Snow',
       'Fog-Rain-Snow-Thunderstorm', 'Fog-Snow', 'Snow', 'Rain-Snow'],
      dtype=object)

Let's create new columns for each of "Rain", "Thunderstorm", "Fog", and "Snow" 
events.

In [50]:
for event_kind in ["Rain", "Thunderstorm", "Fog", "Snow"]:
    col_name = event_kind.lower()  # Turn "Rain" into "rain", etc.
    data[col_name] = data.events.apply(lambda e: event_kind in e)
data.head(5)

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir,rain,thunderstorm,fog,snow
date_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-03-10,56,40,24,24,20,16,74,50,26,30.53,...,6,17.0,0.0,0,,138,False,False,False,False
2012-03-11,67,49,30,43,31,24,78,53,28,30.37,...,7,32.0,1e-10,1,Rain,163,True,False,False,False
2012-03-12,71,62,53,59,55,43,90,76,61,30.13,...,14,36.0,0.03,6,Rain,190,True,False,False,False
2012-03-13,76,63,50,57,53,47,93,66,38,30.12,...,5,24.0,0.0,0,,242,False,False,False,False
2012-03-14,80,62,44,58,52,43,93,68,42,30.15,...,6,22.0,0.0,0,,202,False,False,False,False


## Counting number of days with rain and with rain and snow

Days with rain: 

In [51]:
data.rain.sum()

121

Days rain and snow:

In [54]:
data[data.rain & data.snow].count()[0]

7

### Exercise 8

Was the mean temperature more variable on days with rain and snow than on days 
with just rain or just snow?

**Hint:** don't forget the `std()` function, which returns standard deviation for a sert - a "measure" of variability.

In [55]:
days_with_rain = data[data.rain == True]
days_with_snow = data[data.snow == True]

rain_std = days_with_rain.mean_temp.std()
snow_std = days_with_snow.mean_temp.std()

if rain_std > snow_std:
    print("Rainy days were more variable")
elif snow_std > rain_std:
    print("Snowy days were more variable")
else:
    print("They were the same")

Rainy days were more variable
