# SI 330: Data Manipulation 
## 04 - Joining, Combining, and Reshaping

### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

## Learning Objectives
* use pd.read_html's parameters to extract specific tables from web pages
* create dataframes from lists and dictionaries
* use Pandas' apply function to run a function on each row of a dataframe
* view and set the indexes of a dataframe, including hierarchical indexes
* use loc to explore a dataframe with hierarchical indexes
* use stack and unstack to reshape dataframes
* concatenate two DataFrames by columns
* rename a dataframe's columns with a dictionary
* use Pandas' merge functionn to join dataframes in a SQL-like way

This lab was inspired by https://pythonhealthcare.org/2018/04/08/32-reshaping-pandas-data-with-stack-unstack-pivot-and-melt/

### IMPORTANT: Replace ```?``` in the following code with your uniqname.

In [None]:
MY_UNIQNAME = '?'

## Before we start...
### <font color="magenta">Q1: (1 point) Please let us know what you found confusing in the last class. </font>
We'll try to take time in the next class to review these concepts next class.


Replace this with your response.

## Review from last class

Recall from last class the ```read_html``` function, which made extracting tables from HTML pages a lot easier than using
BeautifulSoup (in fact, it uses bs4 but hides the ugly details).  Let's warm up for today's class by extracting some information from
a number of Wikipedia pages.

Our top-level goal is to extract information about the _aliases_ of some Lord Of The Rings characters.  Take a look at the Wikipedia page
for [Frodo Baggins](https://en.wikipedia.org/wiki/Frodo_Baggins) to get an idea of the sort of pages we're looking at.

In [None]:
import pandas as pd

In [None]:
frodo_url = 'https://en.wikipedia.org/wiki/Frodo_Baggins'

In [None]:
frodo_tables = pd.read_html(frodo_url)

In [None]:
frodo_tables[0]

Now let's load the page for [Legolas](https://en.wikipedia.org/wiki/Legolas):

In [None]:
legolas_url = 'https://en.wikipedia.org/wiki/Legolas'
legolas_tables = pd.read_html(legolas_url)

In [None]:
legolas_tables[0]

Hmmmm.  That doesn't look quite right.

Let's take a look at some URLs and figure out what's going on:

### <font color="magenta">Q2: (1 point) Inspect the Frodo and Legolas pages and see if you can figure out some _attributes_ of the table we're interested in.  </font>


Describe what you found.

You'll notice that there are some characteristics that the "Information" box share across pages.  We can leverage that 
information by using the ```attrs``` attribute of ```read_html```.  For example, if we wanted to extract  the element(s) that had
an ```id``` of ```info```, we could use

```pd.read_html(url,{'id':'info'})```



### <font color="magenta">Q3: (1 point) Fill in the following code block to extract only the "Information" table for the Legolas page:

In [None]:
a = {} # create an appropriate dictionary
pd.read_html(legolas_url, attrs=a)

Now let's define a function that, given a Wikipedia URL, will extract the contents of the Aliases component of the infobox table:

In [None]:
def get_aliases(url):
    tables = pd.read_html(url, attrs={'class':'infobox'}) # extract only tables with class=infobox
    print(url,len(tables))   # sanity check: we should have just 1 table
    infotable = tables[0]    # pull the first table into a DataFrame
    ret = ''                 # initialize an empty string for our return value
    try:                     # in case the next line throws an exception
        x = infotable.set_index(0).loc['Aliases'] # setting the index on column 0 will allow us to use .loc to look up the value of 'Aliases'
        ret = x.values[0]
    except:
        ret = 'None'
    return ret

And let's try it out:

In [None]:
get_aliases(legolas_url)

So far, so good.  It seems to work.  Now let's set up a DataFrame with a bunch of URLs:

In [None]:
urls = ['https://en.wikipedia.org/wiki/Gimli_(Middle-earth)',
        'https://en.wikipedia.org/wiki/Frodo_Baggins',
        'https://en.wikipedia.org/wiki/Legolas',
        'https://en.wikipedia.org/wiki/Bilbo_Baggins',
        'https://en.wikipedia.org/wiki/Samwise_Gamgee',
        'https://en.wikipedia.org/wiki/Peregrin_Took',
        'https://en.wikipedia.org/wiki/Boromir',
        'https://en.wikipedia.org/wiki/Galadriel',
        'https://en.wikipedia.org/wiki/Meriadoc_Brandybuck']
names = ['Gimli',
         'Frodo',
         'Legolas',
         'Bilbo',
         'Sam',
         'Pippin',
         'Boromir',
         'Galadriel',
         'Meriadoc']

In [None]:
udf = pd.DataFrame()
udf['name'] = names
udf['url'] = urls

In [None]:
udf

The pythonic way of iterating through each of those rows would involve the use of some sort of ```for``` loop.  In pandas,
however, as can use the ```apply``` function to process an entire column!

In [None]:
udf['url'].apply(get_aliases)

We can take the resulting Series and assign it to a new column in our DataFrame:

In [None]:
udf['aliases'] = udf['url'].apply(get_aliases)

In [None]:
udf

Let's just put the ```udf``` DataFrame aside for now.  We'll return to it later.

## Creating DataFrames and Exploring Indexes

Let's load the usual libraries...

In [None]:
import pandas as pd
import numpy as np

Let's create some lists of data that we can use to construct a DataFrame:

In [None]:
names = ['Gandalf',
         'Gimli',
         'Frodo',
         'Legolas',
         'Bilbo',
         'Sam',
         'Pippin',
         'Boromir',
         'Aragorn',
         'Galadriel',
         'Meriadoc',
        'Lily']
races = ['Maia',
         'Dwarf',
         'Hobbit',
         'Elf',
         'Hobbit',
         'Hobbit',
         'Hobbit',
         'Man',
         'Man',
         'Elf',
         'Hobbit',
        'Hobbit']
magic = [10, 1, 4, 6, 4, 2, 0, 0, 2, 9, 0, np.NaN]
aggression = [7, 10, 2, 5, 1, 6, 3, 8, 7, 2, 4, np.NaN ]
stealth = [8, 2, 5, 10, 5, 4 ,5, 3, 9, 10, 6, np.NaN]

There are a few different ways to construct a DataFrame.  We can either use an empty constructor and assign Series:

### <font color="magenta"> Q4: (2 points) Construct a dataframe with 5 columns (names, races, magic, aggression, and stealth) using the lists above.

In [None]:
df = # Insert your code here

In [None]:
df

Alternatively, we could have set things up with a dict:

In [None]:
df = pd.DataFrame({'name': names,'race':races,'magic':magic,'aggression': aggression,'stealth':stealth})

In [None]:
df

Let's take a look at the index on the resulting DataFrame:

In [None]:
df.index

We can set the index to something more useful than the default RangeIndex:

In [None]:
df_nameindexed = df.set_index('name')

And if we take a look at the results, we see that we have a pandas Index instead of a RangeIndex:

In [None]:
df_nameindexed.index

In [None]:
df_nameindexed

Setting the name Series as the index allows us to do things like:

In [None]:
df_nameindexed.loc['Aragorn']

Now recall the Hierarchical indexing from the readings.  We can pass a list of column names to set_index to create a Hierarchical Index:

In [None]:
df_racename_indexed = df.set_index(['race','name'])

In [None]:
df_racename_indexed.index

This will allow us to get a DataFrame that matches a value on the outer index:

In [None]:
df_racename_indexed.loc['Hobbit']

We can also use the index on a Series to match the outer index:

In [None]:
df_racename_indexed['magic'].loc['Hobbit']

Or both indexes:

In [None]:
df_racename_indexed['magic'].loc['Hobbit','Frodo']

Or just the inner index:

In [None]:
df_racename_indexed['magic'].loc[:,'Frodo']

### <font color="magenta"> Q5: (1 point) Using .loc find how much aggression Legalos, an Elf, has.

In [None]:
# Insert your code here

## Stacking and Unstacking

Stacking takes "wide" data and makes it "taller"

In [None]:
df.set_index(['race']).stack()

If we call reset_index on the resulting Series, we get the following DataFrame:

In [None]:
df.set_index(['race']).stack().reset_index()

The column names in the above DataFrame aren't particularly helpful, so we can rename them:

In [None]:
df.set_index(['race']).stack().reset_index().rename(columns = {'level_0':'ID','level_1':'variable',0:'value'})

You can do the opposite of stacking by using the ```unstack``` function:

In [None]:
df_stacked = df.stack()

In [None]:
df_stacked

In [None]:
df_stacked.unstack()

Why would we want to stack or unstack?  It depends on what sorts of analyses we want to do "downstream".  It's also the basis for pivoting, melting, and pivot tables, which we'll cover in the next class.

## Joining Data



Let's say we have another CSV file that contains URLs to Wikipedia pages for some of the LOTR characters:

In [None]:
urls = pd.read_csv('data/lotr_wikipedia.csv')

In [None]:
urls

Let's take a look at the original DataFrame:

In [None]:
df

It looks like the rows are "aligned", so we can use the ```concat``` function to concatenate the two DataFrames.
Note that we specify the axis to be the columns.  The default is to concatenate by rows, which isn't what we want.

In [None]:
pd.concat([df,urls],axis="columns")

That's great, and it's consistent with what we've used in previous classes.  But what happens if the 
rows in the two DataFrames don't match up?  Let's load another file that has a slightly different
sequence of rows:

### <font color="magenta"> Q6: (1 point) Construct a dataframe with lotr_wikipedia_wrong_order.csv which is in the data folder.

In [None]:
urls_wrong_order = # Insert your code here

In [None]:
urls_wrong_order

In [None]:
pd.concat([df,urls_wrong_order],axis="columns")

Take a closer look at the name and url columns.  Something's not quite right.

We can work around that by using the appropriate indexing and then using the SQL-like ```merge``` function.

In [None]:
df_names = df.set_index('name')

In [None]:
df_names

In [None]:
urls_wrong_order_names = urls_wrong_order.set_index('name')

In [None]:
df_names.join(urls_wrong_order_names)

In [None]:
df.head()

In [None]:
urls_wrong_order.head()

In [None]:
urls_wrong_order['name']

In [None]:
df['name']

In [None]:
df.merge(urls_wrong_order,on='name')

Now let's add a few additional URLs:

In [None]:
urls_extras = pd.read_csv("data/lotr_wikipedia_extras.csv")

In [None]:
urls_extras

And now let's use concat to add the new entries to the DataFrame.

In [None]:
urls_complete = pd.concat([urls,urls_extras])

In [None]:
urls_complete

Now that we've got a complete (for our purposes) list of URLs, let's use that DataFrame and our original
one to demonstrate the different types of ```join```s.

By default, ```join``` uses a left join, which means the all the values from the "left"
side are used, whether or not there's a corresponding entry from the "right" side.  In the example 
below, note that the url value for "Lily" is "NaN":

In [None]:
df.merge(urls_wrong_order,on='name',how='left')

The "opposite" of a left join is, perhaps unsurprisingly, a "right" join, in which
all the values from the "right" side are used, whether or not a corresponding
value from the "left" side exists. Note in the following example that "Lily" has
disappeared, and Treebeard and Elrond lack information about "race", "magic", "aggression", and "stealth".

In [None]:
df.merge(urls_wrong_order,on='name',how='right')

In addition to "left" and "right" joins, we have "outer" joins, which include
values from both the "left" and "right" DataFrames, regardless of whether
there are corresponding values in the other DataFrame.  Note that all of 
"Lily", "Treebeard" and "Elrond" are present in the following DataFrame:

In [None]:
df.merge(urls_wrong_order,on='name',how='outer')

Finally, there are "inner" joins, which include only those values that exist in both the "left" and "right" DataFrames:

In [None]:
df.merge(urls_wrong_order,on='name',how='inner')

Sometimes it's nice to know how a particular row got added to the resulting DataFrame.  Using ```indicator=True```
allows us to examine this:

In [None]:
df.merge(urls_complete,how='outer',indicator=True)

You'll note that we used the ```merge``` function from the DataFrame and passed in the other DataFrame as an argument.
You can also call the ```merge``` function from pandas directly and pass it the two DataFrames you are merging:

In [None]:
pd.merge(df,urls_complete,how='outer',indicator=True)

### <font color="magenta">Q7: (3 point) Join the ```udf``` DataFrame (that contains aliases) to the ```df``` DataFrame using an appropriate merge

In [None]:
# Insert your code here

# END OF NOTEBOOK
Please remember to submit your notebook in .ipynb and .html formats.