# 10 Data assembly
File(s) needed: concat_1, concat_2, concat_3, survey_person.csv, survey_site.csv, survey_survey.csv, survey_visited.csv

We have covered the basics of loading and inspecting data with pandas. Now, we begin further preparing data for analysis. This notebook covers techniques to combine multiple datasets into one dataset.

We begin with **concatenation**, which will allow us to add rows and columns. Later, we will use a **merge** operation to combine datasets in a way that might look familiar.

# What is tidy data?
_Tidy data_ is data organized so it is easily organized. The idea is championed by a prominent member of the R community, but it applies to any data. The basics are that tidy data meets these three criteria:
- each row is one observation
- each column is one variable
- each table is one observational unit

We will spend more time talking about tidy data later, but this is a good place to start that conversation. You may notice that these three criteria are related to something you may have seen in a previous class: database normalization.

## Why does tidy data matter to us now?
If data is tidy, it will need to be combined to answer a particular question of interest to us. For example, if we are analyzing company performance data, the tidy version may be split into multiple tables, like one for basic company information and another for historical stock prices. If we are going to include a comparison to industry or market performance, that data will located in another table or tables. The historical stock prices may also be split into separate files based upon year, decade, or some other subdivision. Or we may just need to combine two types of data, like combining longitude & latitude with zip codes.

# Concatenation
If the data is split into multiple parts or you want to append data to an existing dataset, **concatenation** is a relatively easy way to combine them.

We use the pandas function `concat` to do this. Let's look at an example using three sample datasets. First, we'll read the data into dataframes and see what is there. 

In [1]:
# As always...
import pandas as pd

In [3]:
# Read the data for the first set of examples
df1=pd.read_csv("..\MIS-3335\data\concat_1.csv")
df2=pd.read_csv("..\MIS-3335\data\concat_2.csv")
df3=pd.read_csv("..\MIS-3335\data\concat_3.csv")

In [6]:
# what is in the first dataframe?
df1

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3


In [5]:
# what is in the second dataframe?
df2

Unnamed: 0,A,B,C,D
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7


In [4]:
# what is in the third dataframe?
df3

Unnamed: 0,A,B,C,D
0,a8,b8,c8,d8
1,a9,b9,c9,d9
2,a10,b10,c10,d10
3,a11,b11,c11,d11


## Adding rows
We previously saw an example of one way to add rows to a dataframe. Using the `concat` function is simpler when the rows to combine are similarly constructed dataframes. The names of the dataframes to be combined are passed to `concat` in a list.

In [7]:
# concatenate the three dataframes
row_concat=pd.concat([df1,df2,df3])
row_concat

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7
0,a8,b8,c8,d8
1,a9,b9,c9,d9


In [None]:
# subset using loc to get the rows with index name of 0


In [None]:
# subset using iloc to get the rows at position 0


In [None]:
# Do not use ix. It is deprecated.


Can we append a series as a new row?

In [8]:
# create a Series to be a new row
new_row_series = pd.Series(['n1', 'n2', 'n3', 'n4'])
new_row_series

0    n1
1    n2
2    n3
3    n4
dtype: object

In [14]:
# add the new row Series
pd.concat([df1,new_row_series])

Unnamed: 0,A,B,C,D,0
0,a0,b0,c0,d0,
1,a1,b1,c1,d1,
2,a2,b2,c2,d2,
3,a3,b3,c3,d3,
0,,,,,n1
1,,,,,n2
2,,,,,n3
3,,,,,n4


That doesn't look like what we were expecting. Why is that? 

What if we turn the series into a dataframe?

In [13]:
new_row_df = pd.DataFrame([['n1', 'n2', 'n3', 'n4']],
                            columns=['A', 'B', 'C', 'D'])
new_row_df

Unnamed: 0,A,B,C,D
0,n1,n2,n3,n4


In [16]:
# concatenate the new row dataframe
pd.concat([df1,new_row_df])

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
0,n1,n2,n3,n4


The pandas `concat` function is a general function that can combine any number of objects. If you just need to combine two dataframes, you can use the `append` method of one of the dataframes.

In [17]:
# combine df2 and the new row dataframe
df2.append(new_row_df)

Unnamed: 0,A,B,C,D
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7
0,n1,n2,n3,n4


In [18]:
# combine df2 and df3
df2.append(df3)

Unnamed: 0,A,B,C,D
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7
0,a8,b8,c8,d8
1,a9,b9,c9,d9
2,a10,b10,c10,d10
3,a11,b11,c11,d11


## Ignoring the index
Look at the indices of the rows in the last example. Isn't it confusing to have repeating index values? If the index values don't really matter, we can reset them when the dataframes are combined by adding the ` ignore_index` option.

In [23]:
# concatenate and reset the index
pd.concat([df1,df2,df3]).reset_index(drop=True)

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
4,a4,b4,c4,d4
5,a5,b5,c5,d5
6,a6,b6,c6,d6
7,a7,b7,c7,d7
8,a8,b8,c8,d8
9,a9,b9,c9,d9


In [22]:
pd.concat([df1,df2,df3],ignore_index=True)

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
4,a4,b4,c4,d4
5,a5,b5,c5,d5
6,a6,b6,c6,d6
7,a7,b7,c7,d7
8,a8,b8,c8,d8
9,a9,b9,c9,d9


## Adding columns
If we want to concatenate columns, we use a similar syntax. The difference is that we have to add the `axis` parameter. The default is `axis=0`, which concatenates rows. The value `axis=1` specifies columns.

In [24]:
all_df=[df1,df2,df3]

In [27]:
# concatenate columns
col_concat=pd.concat(all_df,axis=1)
col_concat

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,a0,b0,c0,d0,a4,b4,c4,d4,a8,b8,c8,d8
1,a1,b1,c1,d1,a5,b5,c5,d5,a9,b9,c9,d9
2,a2,b2,c2,d2,a6,b6,c6,d6,a10,b10,c10,d10
3,a3,b3,c3,d3,a7,b7,c7,d7,a11,b11,c11,d11


In [None]:
# get similar results as before when we subset on column A


Using `concat` will always work if you pass it a dataframe. But remember that we can add a single column without using any special pandas functions.


In [32]:
# add a list as a single new column
col_concat['new_list']=['n1','n2','n3','n4']
col_concat

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2,new_list
0,a0,b0,c0,d0,a4,b4,c4,d4,a8,b8,c8,d8,n1
1,a1,b1,c1,d1,a5,b5,c5,d5,a9,b9,c9,d9,n2
2,a2,b2,c2,d2,a6,b6,c6,d6,a10,b10,c10,d10,n3
3,a3,b3,c3,d3,a7,b7,c7,d7,a11,b11,c11,d11,n4


In [38]:
pd.Series(['n1','n2','n3','n4'])

0    n1
1    n2
2    n3
3    n4
dtype: object

In [39]:
# add a Series as a single new column
col_concat['new_series']=pd.Series(['n1','n2','n3','n4'])
col_concat

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2,new_list,new_series
0,a0,b0,c0,d0,a4,b4,c4,d4,a8,b8,c8,d8,n1,n1
1,a1,b1,c1,d1,a5,b5,c5,d5,a9,b9,c9,d9,n2,n2
2,a2,b2,c2,d2,a6,b6,c6,d6,a10,b10,c10,d10,n3,n3
3,a3,b3,c3,d3,a7,b7,c7,d7,a11,b11,c11,d11,n4,n4


We can also reset the index values for the columns when we concatenate like we did with the row indices so we don't get duplicates.

In [None]:
# use ignore_index with columns


# Concatenating with different indices
To this point we have been combining data where the column names or row indices were the same. What do we do if they don't match? We use the `join` parameter. This works like a SQL join.

## Different columns

In [40]:
# create dataframes with different column names
df1.columns=['A','B','C','D']
df2.columns=['E','F','G','H']
df3.columns=['A','H','F','C']

In [44]:
# concatenate the dataframes
col_concat_ignore=pd.concat(all_df,axis=1,ignore_index=True)
col_concat_ignore

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,a0,b0,c0,d0,a4,b4,c4,d4,a8,b8,c8,d8
1,a1,b1,c1,d1,a5,b5,c5,d5,a9,b9,c9,d9
2,a2,b2,c2,d2,a6,b6,c6,d6,a10,b10,c10,d10
3,a3,b3,c3,d3,a7,b7,c7,d7,a11,b11,c11,d11


In [None]:
# inspect df1


In [None]:
# inspect df2


In [None]:
# inspect df3


In [45]:
# Concatenate the dataframes like we did before. What is different?
pd.concat(all_df)

Unnamed: 0,A,B,C,D,E,F,G,H
0,a0,b0,c0,d0,,,,
1,a1,b1,c1,d1,,,,
2,a2,b2,c2,d2,,,,
3,a3,b3,c3,d3,,,,
0,,,,,a4,b4,c4,d4
1,,,,,a5,b5,c5,d5
2,,,,,a6,b6,c6,d6
3,,,,,a7,b7,c7,d7
0,a8,,d8,,,c8,,b8
1,a9,,d9,,,c9,,b9


The columns get aligned and the missing values are filled with `NaN`. 

That last code cell used the default value for the `join` parameter. The default is `join='outer'`. An outer join keeps all the columns in the dataframes being combined in the join operation. To only keep the columns that the three dataframes all have in common, use `join='inner'`.


In [47]:
# use an inner join
pd.concat(all_df,join='inner')

0
1
2
3
0
1
2
3
0
1
2


What happened?

Let's try a different one.

## Different rows

The previous versions of df1, df2, and df3 all had the same row indices. Let's confirm that before we move on.

In [48]:
# show the index values for df1, df2, and df3
print(df1.index)
print(df2.index)
print(df3.index)

RangeIndex(start=0, stop=4, step=1)
RangeIndex(start=0, stop=4, step=1)
RangeIndex(start=0, stop=4, step=1)


In [None]:
# show info() for df1


In [49]:
# Modify the dataframes to give them different row index values.
df1.index = [0, 1, 2, 3]
df2.index = [4, 5, 6, 7]
df3.index = [0, 2, 5, 7]

print(df1, '\n')
print(df2, '\n')
print(df3)

    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3 

    E   F   G   H
4  a4  b4  c4  d4
5  a5  b5  c5  d5
6  a6  b6  c6  d6
7  a7  b7  c7  d7 

     A    H    F    C
0   a8   b8   c8   d8
2   a9   b9   c9   d9
5  a10  b10  c10  d10
7  a11  b11  c11  d11


When we concatenate on columns (`axis=1`) we get the same kind of results as we did before. The default outer join keeps all the rows and fills missing values with `NaN`. Use the `join='inner'` option to only keep the rows common to all concatenated dataframes.

In [50]:
# Default results 
col_concat = pd.concat([df1, df2, df3], axis=1)
col_concat

Unnamed: 0,A,B,C,D,E,F,G,H,A.1,H.1,F.1,C.1
0,a0,b0,c0,d0,,,,,a8,b8,c8,d8
1,a1,b1,c1,d1,,,,,,,,
2,a2,b2,c2,d2,,,,,a9,b9,c9,d9
3,a3,b3,c3,d3,,,,,,,,
4,,,,,a4,b4,c4,d4,,,,
5,,,,,a5,b5,c5,d5,a10,b10,c10,d10
6,,,,,a6,b6,c6,d6,,,,
7,,,,,a7,b7,c7,d7,a11,b11,c11,d11


In [51]:
# use join='inner' to only keep common rows
pd.concat([df1,df3],axis=1,join="inner")

Unnamed: 0,A,B,C,D,A.1,H,F,C.1
0,a0,b0,c0,d0,a8,b8,c8,d8
2,a2,b2,c2,d2,a9,b9,c9,d9


# Merging multiple datasets
Instead of having one column index you want to use as the basis for joining data, you may have two or more dataframes to combine. This can be the case when data is split into different pieces, such as stock data stored by year or survey data stored by location. These datasets can contain values that are common to them and can therefore be used to combine them in a way that keeps them aligned. In the database world, we call this a "join" between tables.

In our example, we'll use the `merge` function in pandas to combine data saved as smaller parts of a larger set. Each of the tables is one observational unit, qualifying them as tidy data. Remember that being "tidy" is a good thing for storing data, but not necessarliy for analyzing data.

First, lets load and inspect four new csv files.

In [None]:
# load the desired data
person = pd.read_csv('../data/survey_person.csv')
site = pd.read_csv('../data/survey_site.csv')
survey = pd.read_csv('../data/survey_survey.csv')
visited = pd.read_csv('../data/survey_visited.csv')

In [None]:
# inspect the person data


In [None]:
# inspect the site data


In [None]:
# inspect the survey data


In [None]:
# inspect the visited data


You can see that this data is tidy, in part because each observational unit is represented in its own table. If we want to look at the dates a site was visited along with its longitude and latitude, we need to use a pandas function called `merge` to combine these dataframes for analysis.

There are two dataframes in the `merge` function. The dataframe we call is referred to as the _left_ one and the dataframe inside the parentheses is the _right_ one. The `how` parameter specifies how the merge should be conducted. This table (like Table 4.1 from the text, p. 104) shows the `how` values and results, plus the equivalent SQL command.

## TABLE 4.1 - pandas' `how` and SQL

|<p style="text-align:left;">pandas</p>|<p style="text-align:left;">SQL</p>|<p style="text-align:left;">Description</p>|
| --- | --- | --- |
|<p style="text-align:left;font-family:Courier New">left</p>|<p style="text-align:left;">left outer</p> |<p style="text-align:left;">Keep all the keys from the left.</p>|
|<p style="text-align:left;font-family:Courier New">right</p>|<p style="text-align:left;">right outer</p> |<p style="text-align:left;">Keep all the keys from the right.</p>|
|<p style="text-align:left;font-family:Courier New">outer</p>|<p style="text-align:left;">full outer</p> |<p style="text-align:left;">Keep all the keys from both left and right.</p>|
|<p style="text-align:left;font-family:Courier New">inner</p>|<p style="text-align:left;">inner</p> |<p style="text-align:left;">Keep only the keys that exist in both  left and right.</p>|

The last parameter is the `on` value. This specifies which columns are common to the dataframes, allowing them to be lined up. If the column names are different in the two dataframes, use `left_on` and `right_on` to specify them.

We can perform a one-to-one, a many-to-one, or a many-to-many merge. Each is a little different.

## One-to-one merge
 - Join two dataframes, joining one column to another (i.e., keys)
 - There are no duplicates in the joining columns.

In [None]:
# subset one dataframe to use in this example
visited_sub = visited.loc[[0, 2, 6]]
visited_sub

In [None]:
# inspect the site data


In [None]:
# "name" and "site" columns are the same thing just with different names
# default for 'how' option is 'inner' so we can skip it


## Many-to-one merge
- One of the dataframes has key values that repeat.
- Any observations on the single key side will be duplicated in the results.

We'll do this example with the entire `visited` dataframe.

In [None]:
# Inspect the visited data as a reminder


In [None]:
# Inspect the visited data as a reminder


In [None]:
# perform the many-to-one merge


See how the `site` data was duplicated in the matching `visited` rows?

## Many-to-many merge

Sometimes we'll need to make a match based upon multiple columns. We'll create a couple of new dataframes for this example.

In [None]:
# a reminder of what person looks like



In [None]:
# a reminder of what survey looks like


In [None]:
# a reminder of what visited looks like


In [None]:
# create sample tables by merging person with survey and visted with site


In [None]:
# what do these tables look like?


In [None]:
# perform the many-to-many merge
# the multiple columns for the merge are passed in a list


In [None]:
# Look at the first row


pandas automatically adds a suffix to column names in the result if there are any "collisions" in the merge. The columns from the left df get a "\_x" suffix and the right df gets a "\_y" suffix.

We may have the data combined in a form we need for our analysis, but there is considerable redundancy in the ps_vs dataframe. Soon we will talk about how to clean up that redundancy - and other issues.