# Preparing data

## Reading multiple data files

- pd.read_csv()
- pd.read_excel()
- pd.read_html()
- pd.read_json()

***How to read multiple dataframes?***
- `dfs = [pd.read_csv(f) for f in filenames]`
- when many filenames have similar patterns, we can use glob module
  - from glob import glob
  - filename = glob('sales*.csv')
  - the * is a wildcard, that matches >=0 standard characters
  - this creates an iterable list of filenames
  - then we can do this `[pd.read_csv(f) for f in filenames]`

## Reindexing DataFrames
plural - indices/indexes both are correct!

Lets adopt the following convention:
- indices: many index labels within Index data structures
- indexes: multiple index associated with many pandas Index data structures

- df.reindex(listoforderedIndices)
- df.sort_index() if you want to sort
- df.reindex(df2.index)
- might want to df.dropna() because reindexing might add nan rows for the indices it didnt have earlier

### Sorting DataFrame with the Index & columns
It is often useful to rearrange the sequence of the rows of a DataFrame by sorting. You don't have to implement these yourself; the principal methods for doing this are .sort_index() and .sort_values().

- Read 'monthly_max_temp.csv' into a DataFrame called weather1 with 'Month' as the index.
- Sort the index of weather1 in alphabetical order using the .sort_index() method and store the result in weather2.
- Sort the index of weather1 in reverse alphabetical order by specifying the additional keyword argument ascending=False inside .sort_index().
- Use the .sort_values() method to sort weather1 in increasing numerical order according to the values of the column 'Max TemperatureF'.

In [0]:
# Import pandas
import pandas as pd

# Read 'monthly_max_temp.csv' into a DataFrame: weather1
weather1 = pd.read_csv('monthly_max_temp.csv',index_col='Month')

# Print the head of weather1
print(weather1.head())

# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()

# Print the head of weather2
print(weather2.head())

# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)

# Print the head of weather3
print(weather3.head())

# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values('Max TemperatureF')

# Print the head of weather4
print(weather4.head())

### Reindexing DataFrame from a list
Sorting methods are not the only way to change DataFrame Indexes. There is also the .reindex() method.

In this exercise, you'll reindex a DataFrame of quarterly-sampled mean temperature values to contain monthly samples (this is an example of upsampling or increasing the rate of samples, which you may recall from the pandas Foundations course).

The original data has the first month's abbreviation of the quarter (three-month interval) on the Index, namely Apr, Jan, Jul, and Oct. This data has been loaded into a DataFrame called weather1 and has been printed in its entirety in the IPython Shell. Notice it has only four rows (corresponding to the first month of each quarter) and that the rows are not sorted chronologically.

```
       Mean TemperatureF
Month                   
Apr            61.956044
Jan            32.133333
Jul            68.934783
Oct            43.434783
```

You'll initially use a list of all twelve month abbreviations and subsequently apply the .ffill() method to forward-fill the null entries when upsampling. This list of month abbreviations has been pre-loaded as year.

```
['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec']
```

- Reorder the rows of weather1 using the .reindex() method with the list year as the argument, which contains the abbreviations for each month.

- Reorder the rows of weather1 just as you did above, this time chaining the .ffill() method to replace the null values with the last preceding non-null value.

In [0]:
# Import pandas
import pandas as pd

# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

# Print weather2
print(weather2)

# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()

# Print weather3
print(weather3)

Output:
```
           Mean TemperatureF
    Month                   
    Jan            32.133333
    Feb                  NaN
    Mar                  NaN
    Apr            61.956044
    May                  NaN
    Jun                  NaN
    Jul            68.934783
    Aug                  NaN
    Sep                  NaN
    Oct            43.434783
    Nov                  NaN
    Dec                  NaN
           Mean TemperatureF
    Month                   
    Jan            32.133333
    Feb            32.133333
    Mar            32.133333
    Apr            61.956044
    May            61.956044
    Jun            61.956044
    Jul            68.934783
    Aug            68.934783
    Sep            68.934783
    Oct            43.434783
    Nov            43.434783
    Dec            43.434783
```

### Reindexing using another DataFrame Index
Another common technique is to reindex a DataFrame using the Index of another DataFrame. The DataFrame .reindex() method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its .index attribute.

The Baby Names Dataset from data.gov summarizes counts of names (with genders) from births registered in the US since 1881. In this exercise, you will start with two baby-names DataFrames names_1981 and names_1881 loaded for you.

The DataFrames names_1981 and names_1881 both have a MultiIndex with levels name and gender giving unique labels to counts in each row. If you're interested in seeing how the MultiIndexes were set up, names_1981 and names_1881 were read in using the following commands:
```
names_1981 = pd.read_csv('names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv('names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))
```
As you can see by looking at their shapes, which have been printed in the IPython Shell, the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881.
```
Shape of names_1981 DataFrame: (19455, 1)
Shape of names_1881 DataFrame: (1935, 1)

```

Your job here is to use the DataFrame .reindex() and .dropna() methods to make a DataFrame common_names counting names from 1881 that were still popular in 1981.

- Create a new DataFrame common_names by reindexing names_1981 using the index attribute of the DataFrame names_1881 of older names.
- Print the shape of the new common_names DataFrame. This has been done for you. It should be the same as that of names_1881.
- Drop the rows of common_names that have null counts using the .dropna() method. These rows correspond to names that fell out of fashion between 1881 & 1981.
- Print the shape of the reassigned common_names DataFrame. This has been done for you, so hit 'Submit Answer' to see the result!

In [0]:
# Import pandas
import pandas as pd

# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)

# Print shape of common_names
print(common_names.shape)

# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
print(common_names.shape)

## Arithmetic with Series and DataFrames

- broadcasting possible, of a scalar
- dividing a dataframe(each column) with a series: df1.divide(series1, axis='rows')
- percenatge change along the time series: df.pct_change()*100 = (curr-prev)/prev
- df1 + df2 for similar column named dfs
- same thing by df1.add(df2)
- df1.add(df2, fill_value=0)
- triple sum : df1.add(df2,fill_value=0).add(df3, fill_value=0)

In [0]:
'''
Converting temperatue units and renaming the column
'''
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f-32)*5/9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace('F','C')

# Print first 5 rows of temps_c
print(temps_c.head())

### Computing percentage growth of GDP
Your job in this exercise is to compute the yearly percent-change of US GDP (Gross Domestic Product) since 2008.

The data has been obtained from the Federal Reserve Bank of St. Louis and is available in the file GDP.csv, which contains quarterly data; you will resample it to annual sampling and then compute the annual growth of GDP. For a refresher on resampling, check out the relevant material from pandas Foundations.

- Read the file 'GDP.csv' into a DataFrame called gdp, using parse_dates=True and index_col='DATE'.
- Create a DataFrame post2008 by slicing gdp such that it comprises all rows from 2008 onward.
- Print the last 8 rows of the slice post2008. This has been done for you. This data has quarterly frequency so the indices are separated by three-month intervals.
- Create the DataFrame yearly by resampling the slice post2008 by year. Remember, you need to chain .resample() (using the alias 'A' for annual frequency) with some kind of aggregation; you will use the aggregation method .last() to select the last element when resampling.
- Compute the percentage growth of the resampled DataFrame yearly with .pct_change() * 100.

In [0]:
import pandas as pd

# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv',parse_dates=True , index_col='DATE')

# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008':]

# Print the last 8 rows of post2008
print(post2008.tail(8))

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()

# Print yearly
print(yearly)

# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100

# Print yearly again
print(yearly)

### Converting currency of stocks
In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance. The files sp500.csv for sp500 and exchange.csv for the exchange rates are both provided to you.

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

- Read the DataFrames sp500 & exchange from the files 'sp500.csv' & 'exchange.csv' respectively..
- Use parse_dates=True and index_col='Date'.
- Extract the columns 'Open' & 'Close' from the DataFrame sp500 as a new DataFrame dollars and print the first 5 rows.
- Construct a new DataFrame pounds by converting US dollars to British pounds. You'll use the .multiply() method of dollars with exchange['GBP/USD'] and axis='rows'
- Print the first 5 rows of the new DataFrame pounds. This has been done for you, so hit 'Submit Answer' to see the results!

In [0]:
# Import pandas
import pandas as pd

# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv('sp500.csv',parse_dates=True, index_col='Date')

# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv('exchange.csv',parse_dates=True, index_col='Date')

# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open','Close']]

# Print the head of dollars
print(dollars.head())

# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'],axis='rows')

# Print the head of pounds
print(pounds.head())

# Concatenating data

## Appending and concatinating Series

**Stacking on top of one another**
- s1.append(s2) -- only row-wise -- doesnt change indexes
- we can reset index by: s1.append(s2).reset_index(drop=True)
- append multiple dataframes/series by chaining: s1.append(s2).append(s3)
- pd.concat([s1,s2,s3])  -- more flexible, can concat vertically/horizontally
- pd.concat([s1,s2,s3], ignore_index=True)

## Appending and concatinating DataFrames

**concat**
- if they have the same index and column names, they will stack uo as expected
- with different indexes and column, if will form all the different columns and rows with NaNs, just a **UNION** occurs.
-even the same named rows are repeated
- by default concat is axis=0, i.e stacking rows-wise
- with axis=1 or axis='columns', the same named indexes get alligned, columns are concatinated. NaNs are put in relevant places.

In [0]:
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = pd.concat([names_1881,names_1981],ignore_index=True)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

# Print all rows that contain the name 'Morgan'
print(combined_names[combined_names['name']=='Morgan'])

Output 

```
    (19455, 4)
    (1935, 4)
    (21390, 4)
             name gender  count  year
    1283   Morgan      M     23  1881
    2096   Morgan      F   1769  1981
    14390  Morgan      M    766  1981
```

In [0]:
#Initialize an empyy list: medals
medals =[]

for medal in medal_types:
    # Create the file name: file_name
    file_name = "%s_top5.csv" % medal
    # Create list of column names: columns
    columns = ['Country', medal]
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name,header=0,index_col='Country',names=columns)
    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals horizontally: medals_df
medals_df = pd.concat(medals,axis='columns')

# Print medals_df
print(medals_df)

## Concatenation, keys, & MultiIndexes

- two different dfs, same index names, same column name, if concat, will just stack verticaly. If we want to differentiate them, one thing we could do is:
  - `pd.concat([df1,df2], keys=[2013,2014])`
  - this creates **multi-index**, with each df getting an outer index of 2013 and 2014 respectively.
- another way is:
  - to concat `axis=1`
  - unfortunately it may give two same named columns, we can create **multi-level column** by `pd.concat([df1,df2], keys=[2013,2014], axis=1)`
- we can do all these using dict also:
  - dd = { 2013: df1, 2014:df2 }
  - df = pd.concat(dd, axis=0/1)
  - the dict keys are treated as the `keys` argument values.

In [0]:
for medal in medal_types:

    file_name = "%s_top5.csv" % medal
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name,index_col='Country')
    
    # Append medal_df to medals
    medals.append(medal_df)
    
# Concatenate medals: medals
medals = pd.concat(medals,keys=['bronze', 'silver', 'gold'])

# Print medals in entirety
print(medals)

output:
```
                            Total
           Country               
    bronze United States   1052.0
           Soviet Union     584.0
           United Kingdom   505.0
           France           475.0
           Germany          454.0
    silver United States   1195.0
           Soviet Union     627.0
           United Kingdom   591.0
           France           461.0
           Italy            394.0
    gold   United States   2088.0
           Soviet Union     838.0
           United Kingdom   498.0
           Italy            460.0
           Germany          407.0
```

In [0]:
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])

# Print data about silver medals
print(medals_sorted.loc['silver'])

# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[idx[:,'United Kingdom'], :])

OUTPUT:
```
    Total    454.0
    Name: (bronze, Germany), dtype: float64
                     Total
    Country               
    France           461.0
    Italy            394.0
    Soviet Union     627.0
    United Kingdom   591.0
    United States   1195.0
                           Total
           Country              
    bronze United Kingdom  505.0
    gold   United Kingdom  498.0
    silver United Kingdom  591.0
```

In [0]:
# Concatenate dataframes: february
february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'])

# Print february.info()
print(february.info())

# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
slice_2_8 = february.loc['Feb. 2, 2015':'Feb. 8, 2015', idx[:, 'Company']]

# Print slice_2_8
print(slice_2_8)

Output:

```
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 20 entries, 2015-02-02 08:33:01 to 2015-02-26 08:58:51
    Data columns (total 9 columns):
    (Hardware, Company)    5 non-null object
    (Hardware, Product)    5 non-null object
    (Hardware, Units)      5 non-null float64
    (Software, Company)    9 non-null object
    (Software, Product)    9 non-null object
    (Software, Units)      9 non-null float64
    (Service, Company)     6 non-null object
    (Service, Product)     6 non-null object
    (Service, Units)       6 non-null float64
    dtypes: float64(3), object(6)
    memory usage: 1.6+ KB
    None
                                Hardware         Software Service
                                 Company          Company Company
    Date                                                         
    2015-02-02 08:33:01              NaN            Hooli     NaN
    2015-02-02 20:54:49        Mediacore              NaN     NaN
    2015-02-03 14:14:18              NaN          Initech     NaN
    2015-02-04 15:36:29              NaN        Streeplex     NaN
    2015-02-04 21:52:45  Acme Coporation              NaN     NaN
    2015-02-05 01:53:06              NaN  Acme Coporation     NaN
    2015-02-05 22:05:03              NaN              NaN   Hooli
    2015-02-07 22:58:10  Acme Coporation              NaN     NaN
```

In [0]:
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]
print(month_list)
# Create an empty dictionary: month_dict
month_dict = dict()

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby('Company').sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
print(sales)

# Print all sales by Mediacore
idx = pd.IndexSlice
print(sales.loc[idx[:, 'Mediacore'], :])

OUTPUT:

```
    [('january',
                        Date          Company   Product  Units
    0   2015-01-21 19:13:21        Streeplex  Hardware     11
    1   2015-01-09 05:23:51        Streeplex   Service      8
    2   2015-01-06 17:19:34          Initech  Hardware     17
    3   2015-01-02 09:51:06            Hooli  Hardware     16
    4   2015-01-11 14:51:02            Hooli  Hardware     11
    5   2015-01-01 07:31:20  Acme Coporation  Software     18
    6   2015-01-24 08:01:16          Initech  Software      1
    7   2015-01-25 15:40:07          Initech   Service      6
    8   2015-01-13 05:36:12            Hooli   Service      7
    9   2015-01-03 18:00:19            Hooli   Service     19
    10  2015-01-16 00:33:47            Hooli  Hardware     17
    11  2015-01-16 07:21:12          Initech   Service     13
    12  2015-01-20 19:49:24  Acme Coporation  Hardware     12
    13  2015-01-26 01:50:25  Acme Coporation  Software     14
    14  2015-01-15 02:38:25  Acme Coporation   Service     16
    15  2015-01-06 13:47:37  Acme Coporation  Software     16
    16  2015-01-15 15:33:40        Mediacore  Hardware      7
    17  2015-01-27 07:11:55        Streeplex   Service     18
    18  2015-01-20 11:28:02        Streeplex  Software     13
    19  2015-01-16 19:20:46        Mediacore   Service      8), 
    ('february',
                        Date          Company   Product  Units
    0   2015-02-26 08:57:45        Streeplex   Service      4
    1   2015-02-16 12:09:19            Hooli  Software     10
    2   2015-02-03 14:14:18          Initech  Software     13
    3   2015-02-02 08:33:01            Hooli  Software      3
    4   2015-02-25 00:29:00          Initech   Service     10
    5   2015-02-05 01:53:06  Acme Coporation  Software     19
    6   2015-02-09 08:57:30        Streeplex   Service     19
    7   2015-02-11 20:03:08          Initech  Software      7
    8   2015-02-04 21:52:45  Acme Coporation  Hardware     14
    9   2015-02-09 13:09:55        Mediacore  Software      7
    10  2015-02-07 22:58:10  Acme Coporation  Hardware      1
    11  2015-02-11 22:50:44            Hooli  Software      4
    12  2015-02-26 08:58:51        Streeplex   Service      1
    13  2015-02-05 22:05:03            Hooli   Service     10
    14  2015-02-04 15:36:29        Streeplex  Software     13
    15  2015-02-19 16:02:58        Mediacore   Service     10
    16  2015-02-19 10:59:33        Mediacore  Hardware     16
    17  2015-02-02 20:54:49        Mediacore  Hardware      9
    18  2015-02-21 05:01:26        Mediacore  Software      3
    19  2015-02-21 20:41:47            Hooli  Hardware      3),
     ('march',
                         Date          Company   Product  Units
    0   2015-03-22 14:42:25        Mediacore  Software      6
    1   2015-03-12 18:33:06          Initech   Service     19
    2   2015-03-22 03:58:28        Streeplex  Software      8
    3   2015-03-15 00:53:12            Hooli  Hardware     19
    4   2015-03-17 19:25:37            Hooli  Hardware     10
    5   2015-03-16 05:54:06        Mediacore  Software      3
    6   2015-03-25 10:18:10          Initech  Hardware      9
    7   2015-03-25 16:42:42        Streeplex  Hardware     12
    8   2015-03-26 05:20:04        Streeplex  Software      3
    9   2015-03-06 10:11:45        Mediacore  Software     17
    10  2015-03-22 21:14:39          Initech  Hardware     11
    11  2015-03-17 19:38:12            Hooli  Hardware      8
    12  2015-03-28 19:20:38  Acme Coporation   Service      5
    13  2015-03-13 04:41:32        Streeplex  Hardware      8
    14  2015-03-06 02:03:56        Mediacore  Software     17
    15  2015-03-13 11:40:16          Initech  Software     11
    16  2015-03-27 08:29:45        Mediacore  Software      6
    17  2015-03-21 06:42:41        Mediacore  Hardware     19
    18  2015-03-15 08:50:45          Initech  Hardware     18
    19  2015-03-13 16:25:24        Streeplex  Software      9)]
                              
                              Units
             Company               
    february Acme Coporation     34
             Hooli               30
             Initech             30
             Mediacore           45
             Streeplex           37
    january  Acme Coporation     76
             Hooli               70
             Initech             37
             Mediacore           15
             Streeplex           50
    march    Acme Coporation      5
             Hooli               37
             Initech             68
             Mediacore           68
             Streeplex           40
                        Units
             Company         
    february Mediacore     45
    january  Mediacore     15
    march    Mediacore     68
```

## In numpy arrays:

- **horizontal** stacking: same # of row, # of columns can differ
- np.hstack([A,B])
- np.concatenate([A,B], axis=1)
- **vertical** stacking: same # of col, # of rows can differ
- np.vstack([A,B])
- np.concatenate([A,B], axis=0)

## Joins:

- **Outer join** :
  - union of index sets (all labels, no reps)
  - missing values filled with NaN
  - pd.concat([df1,df2],axis=1,join='outer') #default is 'outer'
- **Inner join** :
  - intersection of index sets (only common labels)
  - pd.concat([df1,df2],axis=1,join='inner')
- can also do inner and outer for axis=0

# Merging data

## Mergining DataFrames

- merge extends concat, with the ability to **align rows using multiple columns**
- `pd.merge(df1,df2)` : merges based on all columns that occur in *both* dataframes, values have to be silimar too, this is by-deafult, an inner join
- `pd.merge(a,b,on='col')`: merges based on 'col' column.
- `pd.merge(a,b,on=['col1','col2'],suffixes=['_year1','year2])` to change the default suffix of x and y.
- what if the column names differ?(but are the same, so we want to merge on it): `pd.merge(a,b,left_on='col1',right_on='col2')`

## Joining DataFrames:

- `pd.merge(a,b,left_on='col1',right_on='col2',how='inner')` #default is inner
- how = 'left'
- how = 'right'
- how = 'outer'

- we can also: df1.join(df2, how='left') #how is left by default, and also on the index

![What to use?](df.png)

## Ordered Merges:

- for the indexes which have ordering naturally, e.g date-time
- merge+outer+.sorted('Dates') = merge_ordered(df1,df2)

In [0]:
# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin,houston)

# Print tx_weather
print(tx_weather)

# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin,houston,on='date',suffixes=['_aus','_hus'])

# Print tx_weather_suff
print(tx_weather_suff)

# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin,houston,on='date',suffixes=['_aus','_hus'],fill_method='ffill')

# Print tx_weather_ffill
print(tx_weather_ffill)

### Using merge_asof()
Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.

This function can be used to align disparate datetime frequencies without having to first resample.

Here, you'll merge monthly oil prices (US dollars) into a full automobile fuel efficiency dataset. The oil and automobile DataFrames have been pre-loaded as oil and auto. The first 5 rows of each have been printed in the IPython Shell for you to explore.
```
oil
        Date  Price
0 1970-01-01   3.35
1 1970-02-01   3.35
2 1970-03-01   3.35
3 1970-04-01   3.35
4 1970-05-01   3.35

auto
    mpg  cyl  displ   hp  weight  accel         yr origin                       name
0  18.0    8  307.0  130    3504   12.0 1970-01-01     US  chevrolet chevelle malibu
1  15.0    8  350.0  165    3693   11.5 1970-01-01     US          buick skylark 320
2  18.0    8  318.0  150    3436   11.0 1970-01-01     US         plymouth satellite
3  16.0    8  304.0  150    3433   12.0 1970-01-01     US              amc rebel sst
4  17.0    8  302.0  140    3449   10.5 1970-01-01     US                ford torino
```
These datasets will align such that the first price of the year will be broadcast into the rows of the automobiles DataFrame. This is considered correct since by the start of any given year, most automobiles for that year will have already been manufactured.

You'll then inspect the merged DataFrame, resample by year and compute the mean 'Price' and 'mpg'. You should be able to see a trend in these two columns, that you can confirm by computing the Pearson correlation between resampled 'Price' and 'mpg'.

- Merge auto and oil using pd.merge_asof() with left_on='yr' and right_on='Date'. Store the result as merged.
- Print the tail of merged. This has been done for you.
- Resample merged using 'A' (annual frequency), and on='Date'. Select [['mpg','Price']] and aggregate the mean. Store the result as yearly.
- Hit Submit Answer to examine the contents of yearly and yearly.corr(), which shows the Pearson correlation between the resampled 'Price' and 'mpg'.

In [0]:
# Merge auto and oil: merged
merged = pd.merge_asof(auto,oil,left_on='yr',right_on='Date')

# Print the tail of merged
print(merged.tail())

# Resample merged: yearly
yearly = merged.resample('A',on='Date')[['mpg','Price']].mean()

# Print yearly
print(yearly)

# print yearly.corr()
print(yearly.corr())

```
          mpg  cyl  displ  hp  weight  ...         yr  origin             name       Date  Price
    387  27.0    4  140.0  86    2790  ... 1982-01-01      US  ford mustang gl 1982-01-01  33.85
    388  44.0    4   97.0  52    2130  ... 1982-01-01  Europe        vw pickup 1982-01-01  33.85
    389  32.0    4  135.0  84    2295  ... 1982-01-01      US    dodge rampage 1982-01-01  33.85
    390  28.0    4  120.0  79    2625  ... 1982-01-01      US      ford ranger 1982-01-01  33.85
    391  31.0    4  119.0  82    2720  ... 1982-01-01      US       chevy s-10 1982-01-01  33.85
    
    [5 rows x 11 columns]
                      mpg  Price
    Date                        
    1970-12-31  17.689655   3.35
    1971-12-31  21.111111   3.56
    1972-12-31  18.714286   3.56
    1973-12-31  17.100000   3.56
    1974-12-31  22.769231  10.11
    1975-12-31  20.266667  11.16
    1976-12-31  21.573529  11.16
    1977-12-31  23.375000  13.90
    1978-12-31  24.061111  14.85
    1979-12-31  25.093103  14.85
    1980-12-31  33.803704  32.50
    1981-12-31  30.185714  38.00
    1982-12-31  32.000000  33.85
                mpg     Price
    mpg    1.000000  0.948677
    Price  0.948677  1.000000
```

The expanding mean provides a way to see this down each column. It is the value of the mean with all the data available up to that point in time. If you are interested in learning more about pandas' expanding transformations, this section of the pandas documentation has additional information.