# Using merge_asof() to create dataset

The merge_asof() function can be used to create datasets where you have a table of start and stop dates, and you want to use them to create a flag in another table. You have been given gdp, which is a table of quarterly GDP values of the US during the 1980s. Additionally, the table recession has been given to you. It holds the starting date of every US recession since 1980, and the date when the recession was declared to be over. Use merge_asof() to merge the tables and create a status flag if a quarter was during a recession. Finally, to check your work, plot the data in a bar chart.


* Using merge_asof(), merge gdp and recession on date, with gdp as the left table. Save to the variable gdp_recession.
* Create a list using a list comprehension and a conditional expression, named is_recession, where for each row if the gdp_recession['econ_status'] value is equal to 'recession' then enter 'r' else 'g'.
* Using gdp_recession, plot a bar chart of gdp versus date, setting the color argument equal to is_recession.

In [1]:
import pandas as pd

recession = pd.read_csv("/kaggle/input/recession/CPICPIAUCSL.csv")
gdp = pd.read_csv("/kaggle/input/world-gdpgdp-gdp-per-capita-and-annual-growths/gdp.csv")

recession.head()


Unnamed: 0,DATE,CPIAUCSL
0,1947-01-01,21.48
1,1947-02-01,21.62
2,1947-03-01,22.0
3,1947-04-01,22.0
4,1947-05-01,21.95


In [2]:
gdp.head()

Unnamed: 0,Country Name,Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
0,Aruba,ABW,,,,,,,,,...,2534637000.0,2727850000.0,2790849000.0,2962905000.0,2983637000.0,3092430000.0,3202189000.0,,,
1,Africa Eastern and Southern,AFE,19313110000.0,19723490000.0,21493920000.0,25733210000.0,23527440000.0,26810570000.0,29152160000.0,30173170000.0,...,950521400000.0,964242400000.0,984807100000.0,919930000000.0,873354900000.0,985355700000.0,1012853000000.0,1009910000000.0,920792300000.0,
2,Afghanistan,AFG,537777800.0,548888900.0,546666700.0,751111200.0,800000000.0,1006667000.0,1400000000.0,1673333000.0,...,19907320000.0,20146400000.0,20497130000.0,19134210000.0,18116560000.0,18753470000.0,18053230000.0,18799450000.0,20116140000.0,
3,Africa Western and Central,AFW,10404280000.0,11128050000.0,11943350000.0,12676520000.0,13838580000.0,14862470000.0,15832850000.0,14426430000.0,...,727571400000.0,820787600000.0,864966600000.0,760729700000.0,690543000000.0,683741600000.0,741691600000.0,794572500000.0,784587600000.0,
4,Angola,AGO,,,,,,,,,...,128052900000.0,136709900000.0,145712200000.0,116193600000.0,101123900000.0,122123800000.0,101353200000.0,89417190000.0,58375980000.0,


In [None]:
# Merge gdp and recession on date using merge_asof()
gdp_recession = pd.merge_asof(gdp, recession, on = "date")

# Create a list based on the row value of gdp_recession['econ_status']
is_recession = ['r' if s=='recession' else 'g' for s in gdp_recession['econ_status']]

# Plot a bar chart of gdp_recession
gdp_recession.plot(kind="bar", y="gdp", x="date", color=is_recession, rot=90)
plt.show()

![image.png](attachment:f3b292e4-4025-4566-8bac-5aa28c86e7cb.png)

You can see from the chart that there were a number of quarters early in the 1980s where a recession was an issue. merge_asof() allowed you to quickly add a flag to the gdp dataset by matching between two different dates, in one line of code! If you were to perform the same task using subsetting, it would have taken a lot more code.

![image.png](attachment:6df0a7a9-a253-4191-a26f-4f109964d680.png)

**Subsetting rows with .query()**
In this exercise, you will revisit GDP and population data for Australia and Sweden from the World Bank and expand on it using the .query() method. You'll merge the two tables and compute the GDP per capita. Afterwards, you'll use the .query() method to sub-select the rows and create a plot. Recall that you will need to merge on multiple columns in the proper order.


* Use merge_ordered() on gdp and pop on columns country and date with the fill feature, save to gdp_pop and print.
* Add a column named gdp_per_capita to gdp_pop that divides gdp by pop.
* Pivot gdp_pop so values='gdp_per_capita', index='date', and columns='country', save as gdp_pivot.
* Use .query() to select rows from gdp_pivot where date is greater than equal to "1991-01-01". Save as recent_gdp_pop.


In [None]:
# Merge gdp and pop on date and country with fill
gdp_pop = pd.merge_ordered(gdp, pop, on=['country','date'], fill_method='ffill')

# Add a column named gdp_per_capita to gdp_pop that divides the gdp by pop
gdp_pop['gdp_per_capita'] = gdp_pop['gdp'] / gdp_pop['pop']

# Pivot data so gdp_per_capita, where index is date and columns is country
gdp_pivot = gdp_pop.pivot_table('gdp_per_capita', 'date', 'country')

# Select dates equal to or greater than 1991-01-01
recent_gdp_pop = gdp_pivot.query('date >= "1991-01-01"')

# Plot recent_gdp_pop
recent_gdp_pop.plot(rot=90)
plt.show()

![image.png](attachment:9535c443-542b-4672-8309-a04236362063.png)

Amazing! You can see from the plot that the per capita GDP of Australia passed Sweden in 1992. By using the .query() method, you were able to select the appropriate rows easily. The .query() method is easy to read and straightforward.

# Using .melt() to reshape government data

The US Bureau of Labor Statistics (BLS) often provides data series in an easy-to-read format - it has a separate column for each month, and each year is a different row. Unfortunately, this wide format makes it difficult to plot this information over time. In this exercise, you will reshape a table of US unemployment rate data from the BLS into a form you can plot using .melt(). You will need to add a date column to the table and sort by it to plot the data correctly.

The unemployment rate data has been loaded for you in a table called ur_wide. You are encouraged to explore this table before beginning the exercise.


* Use .melt() to unpivot all of the columns of ur_wide except year and ensure that the columns with the months and values are named month and unempl_rate, respectively. Save the result as ur_tall.
* Add a column to ur_tall named date which combines the year and month columns as year-month format into a larger string, and converts it to a date data type.
* Sort ur_tall by date and save as ur_sorted.
* Using ur_sorted, plot unempl_rate on the y-axis and date on the x-axis.

In [None]:
# unpivot everything besides the year column
ur_tall = ur_wide.melt(id_vars= ['year'], var_name = 'month', value_name = "unempl_rate")


# Create a date column using the month and year columns of ur_tall
ur_tall['date'] = pd.to_datetime(ur_tall['month'] + '-' + ur_tall['year'])

# Sort ur_tall by date in ascending order
ur_sorted = ur_tall.sort_values('date')
# Plot the unempl_rate by date
ur_sorted.plot(x = 'date', y ='unempl_rate')
plt.show()

![image.png](attachment:46b02731-783e-4a37-b853-4d6282b25a0b.png)

Nice going! The plot shows a steady decrease in the unemployment rate with an increase near the end. This increase is likely the effect of the COVID-19 pandemic and its impact on shutting down most of the US economy. In general, data is often provided (_especially by governments_) in a format that is easily read by people but not by machines. The .melt() method is a handy tool for reshaping data into a useful form.

**Using .melt() for stocks vs bond performance**

It is widespread knowledge that the price of bonds is inversely related to the price of stocks. In this last exercise, you'll review many of the topics in this chapter to confirm this. You have been given a table of percent change of the US 10-year treasury bond price. It is in a wide format where there is a separate column for each year. You will need to use the .melt() method to reshape this table.

Additionally, you will use the .query() method to filter out unneeded data. You will merge this table with a table of the percent change of the Dow Jones Industrial stock index price. Finally, you will plot data.

The tables ten_yr and dji have been loaded for you.



* Use .melt() on ten_yr to unpivot everything except the metric column, setting var_name='date' and value_name='close'. Save the result to bond_perc.
* Using the .query() method, select only those rows where metric equals 'close', and save to bond_perc_close.
* Use merge_ordered() to merge dji (left table) and bond_perc_close on date with an inner join, and set suffixes equal to ('_dow', '_bond'). Save the result to dow_bond.
* Using dow_bond, plot only the Dow and bond values.

In [None]:
# Use melt on ten_yr, unpivot everything besides the metric column
bond_perc = ten_yr.melt(id_vars= ['metric'], var_name = "date", value_name = "close")

# Use query on bond_perc to select only the rows where metric=close
bond_perc_close = bond_perc.query('metric == "close"')
# Merge (ordered) dji and bond_perc_close on date with an inner join
dow_bond = pd.merge_ordered(dji, bond_perc_close,how = "inner", on ="date", suffixes = ('_dow', '_bond'))


# Plot only the close_dow and close_bond columns
dow_bond.plot(y=['close_dow', 'close_bond'], x='date', rot=90)
plt.show()

![image.png](attachment:ff667002-f84d-4570-94e7-8c8764d58717.png)

Super job! You used many of the techniques we have reviewed in this chapter to produce the plot. The plot confirms that the bond and stock prices are inversely correlated. Often as the price of stocks increases, the price for bonds decreases.