<a href="https://colab.research.google.com/github/taegeonyu/hds5210-2023/blob/main/week14/module56-merging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Merging DataFrames Together

In this module, we're going to talk about two different types of merging: concatenation and masking

In [None]:
!pip install awscli

In [None]:
import pandas as pd

## Concatenation

To "concatenate" means to combine things end-to-end.  That is, we're going to merge together multiple data sets in a way that we just keep appending more rows end-on-end.

In `https://hds5210-data.s3.amazonaws.com/drinking/` there are a whole list of files that we want to merge together into a single data frame.  They all have the same format, but the are from different cities.

In [None]:
%%bash

# *** WARNING ***
# Do not run this code on your local machine if you have the AWS CLI already configured
# It could cause a problem with your existing security credentials
# and permantently erase your existing access keys and secrets

# If you're curious about what this code does,
# it creates a file called ~/.aws/credentials with credentials I've created
# that allow you to list files in a particular AWS s3 storage bucket.

mkdir -p ~/.aws
grep hds5210 ~/.aws/credentials 2>/dev/null || cat >>~/.aws/credentials <<EOF
[hds5210]
aws_access_key_id = AKIAUXBOKEFK63ZPGD62
aws_secret_access_key = ***
aws_default_region = us-east-1
EOF
chmod 644 ~/.aws/credentials
cat ~/.aws/credentials

In [None]:
# Then this one-liner gets a list of the files in a specific storage
# bucket subfolder and writes that list of files to a files.txt file.
# After you run this code, you should see a file in Google Colab
# with this same name.  From there, we'll use Python code to get the files.
!aws --profile hds5210 s3 ls s3://hds5210-data/drinking/ >files.txt

In [None]:
# Here's a function we'll use to read all of the file names from that
# text file that the aws command above created.
# The command above outputs in a "human readable" format that we have to parse
# making some assumptions (like file names won't have spaces in them).  It
# only works because this specific subfolder doesn't have any files with spaces
# in the name.

def get_files(listing_file):
  files = []

  # Open the listing file
  with open(listing_file) as f:
    for line in f.readlines():
      # Split based on space, grab the last item, strip off extra newline
      name = line.split(' ')[-1].strip()
      # The aws command returns an empty-name file as well for some reason
      # So, we'll strip that out
      if len(name) > 0:
        files.append(name)

  # Return the list of files
  return files

In [None]:
files = get_files('files.txt')

In [None]:
files

In [None]:
len(files)

In [None]:
# Then, let's read each of those files into their own df and store that in a list of dfs
dataframes = []

In [None]:
for f in files:
    df = pd.read_csv('https://hds5210-data.s3.amazonaws.com/drinking/'+f)
    print(f'Read {f}')
    dataframes.append(df)

In [None]:
len(dataframes)

In [None]:
type(dataframes[0])

In [None]:
dataframes[0].head()

In [None]:
len(dataframes)

In [None]:
# Then we can concatenate them together with pd.concat
drinking = pd.concat(dataframes)

Let's check to make sure the counts match up...

Length of combined dataframe == Sum of the length of the individual dataframes?

In [None]:
len(drinking)

In [None]:
sum([len(x) for x in dataframes])

In [None]:
drinking.head()

It's also possible to label the rows as they get concatenated together.  That can be handy if you want to keep track of which input file each row came from.

In [None]:
drinking2 = pd.concat(dataframes, keys=files)

In [None]:
drinking2.head()

In [None]:
drinking2.head().reset_index(names=['File','Number'])

In [None]:
drinking2.index.levels[0]

## Concatenating Side-by-Side

The stacking example above is more common, but it might be interesting to concatenate data side-by-side.

In [None]:
names1=[['Paul','Boal'],['Anny', 'Monroe'],['Eric','Westhus'],['Andy','Slavitt']]
names2=[['Paul Boal'],['Anny Monroe'],['Eric Westhus'],[''],['Mario Garza']]
n1 = pd.DataFrame(names1, columns=['First','Last'])
n2 = pd.DataFrame(names2, columns=['Full Name'])

In [None]:
n1

In [None]:
n2

In [None]:
pd.concat([n1,n2], axis=1)

## Masking

With "masking", we are taking two data sets and overlaying one ontop of the other.  If the first has values, then those will be kept.  If the first has a blank (NaN), then the underlying value from the next data set will be shown.

In [None]:
nppes1 = pd.read_csv('https://hds5210-data.s3.amazonaws.com/nppes1.csv')
nppes2 = pd.read_csv('https://hds5210-data.s3.amazonaws.com/nppes2.csv')
nppes1.set_index('NPI', inplace=True)
nppes2.set_index('NPI', inplace=True)

In [None]:
nppes2.head()

In [None]:
nppes1['State'].count()

In [None]:
len(nppes1)

In [None]:
len(nppes2)

In [None]:
nppes2

In [None]:
nppes1[pd.isnull(nppes1['State'])]

In [None]:
combined = nppes1.combine_first(nppes2)

In [None]:
combined['State'].count()

In [None]:
len(nppes1)

In [None]:
combined.loc[1225590060]

In [None]:
nppes1.loc[1225590060]