# `pandas` Part 7: Combining Datasets with `concat()`

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Combine DataFrames and/or Series with `concat()`
2. Understand a multi-index
3. Reset an index with `reset_index()`
4. Perform descriptive analytics on a combined DataFrame

## Files Needed for this lesson:
>- `CAvideos.csv`
>- `GBvideos.csv`
>- Download this csv files from Canvas prior to the lesson
>- C:\\Users\\mimc2537\\OneDrive - UCB-O365\\python\\pandas\\

## The general steps to working with pandas:
1. import pandas as pd
2. Create or load data into a pandas DataFrame or Series
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
>- Note: if the file you want to read into your notebook is not in the same folder you can do one of two things:
>>- Move the file you want to read into the same folder/directory as the notebook
>>- Type out the full path into the read function
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

# Introduction Notes on Combining Data Using `pandas`
1. Being able to combine data from multiple sources is a critical skill for analytics professionals
2. We will learn the `pandas` way of combining data but there are similarities here to SQL
3. Why combine data with `pandas` if you can do the same thing in SQL?
>- The answer to this depends on the project
>- Some projects may be completed more efficiently all with `pandas` so you wouldn't necessarily need SQL
>- For some projects incorporating SQL into our python code makes sense
>- In a an analytics job, you will likely use both python and SQL to get the job done! 

# Initial set-up steps
1. import modules and check working directory
2. Read data in
3. Check the data

In [1]:
import os
import pandas as pd

os.getcwd()
for file in os.listdir():
    if '.csv' in file:
        print(file)
    elif '.xlsx' in file:
        print(file)
        
file1 = "CAvideos.csv"
file2 = "GBvideos.csv"


CAvideos.csv
GBvideos.csv
winemag-data-130k-v2 (2).csv


# Step 2 Read Data Into a DataFrame with `read_csv()`
>- file names: 
>>- `CAvideos.csv`
>>- `GBvideos.csv`

In [2]:
ca = pd.read_csv(file1)
gb = pd.read_csv(file2)

### Check how many rows and columns are in our DataFrames

In [3]:
ca.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
1,0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...


In [4]:
gb.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,Jw1Y-zhQURU,17.14.11,John Lewis Christmas Ad 2017 - #MozTheMonster,John Lewis,26,2017-11-10T07:38:29.000Z,"christmas|""john lewis christmas""|""john lewis""|...",7224515,55681,10247,9479,https://i.ytimg.com/vi/Jw1Y-zhQURU/default.jpg,False,False,False,Click here to continue the story and make your...
1,3s1rvMFUweQ,17.14.11,Taylor Swift: …Ready for It? (Live) - SNL,Saturday Night Live,24,2017-11-12T06:24:44.000Z,"SNL|""Saturday Night Live""|""SNL Season 43""|""Epi...",1053632,25561,2294,2757,https://i.ytimg.com/vi/3s1rvMFUweQ/default.jpg,False,False,False,Musical guest Taylor Swift performs …Ready for...
2,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787420,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...


### Check a couple of rows of data in one of the new DataFrames

In [5]:
print(ca.shape, gb.shape)

(40881, 16) (38916, 16)


## Check the datatypes

In [6]:
gb.dtypes

video_id                  object
trending_date             object
title                     object
channel_title             object
category_id                int64
publish_time              object
tags                      object
views                      int64
likes                      int64
dislikes                   int64
comment_count              int64
thumbnail_link            object
comments_disabled           bool
ratings_disabled            bool
video_error_or_removed      bool
description               object
dtype: object

# Combining DataFrames
>- The three common ways to combine datastest in pandas is with `concat()`, `join()`, and `merge()`
>- `concat()` will take two DataFrames or Series and append them together
>>- This is basically taking DataFrames and stacking their data on top of each other into one DataFrame
>>- For `concat()` you need the columns/fields in both DataFrames to the be the same
>- `join()` "links" DataFrames together based on a common field/column between the two
>- `merge()` also links DataFrames together based on common field/columns but with different syntax.
>>- We will cover the most basic join in this class
>>- A more in depth study of joins is provided in SQL focused courses
>>- Pandas join reference for further study: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html


# Using the YouTube DataFrames to practice combining data with pandas
>- The YouTube datasets store data on various YouTube trending statistics
>- Our example datasets show several months of data and daily trending YouTube videos.

>- For more information and for other YouTube datasets see the following link:
>>- https://www.kaggle.com/datasnaek/youtube-new

### First, creating a new DataFrame that appends the Canadian and British YouTube DataFrames

In [7]:
CanUK = pd.concat([ca, gb], keys=['CA','GB'])
print(CanUK.shape, ca.shape, gb.shape)

(79797, 16) (40881, 16) (38916, 16)


#### Some notes on the previous code
>- Line 1: We define a new DataFrame named `CanUK` which is defined as the concatenation,`concat()`, of two datasets
>>- Dataset 1 = canadian_youtube
>>- Dataset 2 = uk_youtube
>>- The `concat()` function takes the two (or more if applicable) DataFrames and "stacks" them on top of each
>- Line 2: We use `keys` option to define a multi-index (aka hierarchical index)
>>- Because our datasets represent YouTube videos from different countries we pass the abbreviated names of those countries as a list to `keys`
>>- Enter the keys names in order they appear in line 1 (e.g., 'can' first, 'uk' second)
>- Line 3: We use the `names` option to label our index columns from line 2
>>- Without the `names` option we would not have anything above our index columns

### Check the index for any dataframe using `DataFrame.index`
>- Note how `concat()` uses the rowid's for each country's dataset versus continuing the count

In [8]:
CanUK.index

MultiIndex([('CA',     0),
            ('CA',     1),
            ('CA',     2),
            ('CA',     3),
            ('CA',     4),
            ('CA',     5),
            ('CA',     6),
            ('CA',     7),
            ('CA',     8),
            ('CA',     9),
            ...
            ('GB', 38906),
            ('GB', 38907),
            ('GB', 38908),
            ('GB', 38909),
            ('GB', 38910),
            ('GB', 38911),
            ('GB', 38912),
            ('GB', 38913),
            ('GB', 38914),
            ('GB', 38915)],
           length=79797)

### Take a look at our new DataFrame

In [9]:
CanUK.head(3)

Unnamed: 0,Unnamed: 1,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
CA,0,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
CA,1,0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...
CA,2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...


#### Did using `concat()` work to append the two DataFrames together? 
>- Check the shape of your new DataFrame
>- Compare the number of records to each one individually
>>- canadian_youtube = 40881 records
>>- uk_youtube = 38916 records
>>- 40881 + 38916 = 79797 total records

#### `reset_index`:
##### Note: You can reset a an index with `reset_index` 
>- This can be useful for some situations
>- For a multi-index you can pass the `level` option and specify what index you want to reset
>- Note: To make the change to our current DataFrame we would need to use the option, `inplace=True`

In [10]:
CanUK.reset_index(level = 0, inplace = True)

CanUK.head(3)

Unnamed: 0,level_0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,CA,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
1,CA,0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...
2,CA,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...


In [11]:
CanUK = CanUK.rename(columns = {'level_0':'Country'})
CanUK

Unnamed: 0,Country,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,CA,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
1,CA,0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...
2,CA,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,CA,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095828,132239,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...
4,CA,2Vv-BfVoq4g,17.14.11,Ed Sheeran - Perfect (Official Music Video),Ed Sheeran,10,2017-11-09T11:04:14.000Z,"edsheeran|""ed sheeran""|""acoustic""|""live""|""cove...",33523622,1634130,21082,85067,https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg,False,False,False,🎧: https://ad.gt/yt-perfect\n💰: https://atlant...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38911,GB,l884wKofd54,18.14.06,Enrique Iglesias - MOVE TO MIAMI (Official Vid...,EnriqueIglesiasVEVO,10,2018-05-09T07:00:01.000Z,"Enrique Iglesias feat. Pitbull|""MOVE TO MIAMI""...",25066952,268088,12783,9933,https://i.ytimg.com/vi/l884wKofd54/default.jpg,False,False,False,NEW SONG - MOVE TO MIAMI feat. Pitbull (Click ...
38912,GB,IP8k2xkhOdI,18.14.06,Jacob Sartorius - Up With It (Official Music V...,Jacob Sartorius,10,2018-05-11T17:09:16.000Z,"jacob sartorius|""jacob""|""up with it""|""jacob sa...",1492219,61998,13781,24330,https://i.ytimg.com/vi/IP8k2xkhOdI/default.jpg,False,False,False,THE OFFICIAL UP WITH IT MUSIC VIDEO!Get my new...
38913,GB,Il-an3K9pjg,18.14.06,Anne-Marie - 2002 [Official Video],Anne-Marie,10,2018-05-08T11:05:08.000Z,"anne|""marie""|""anne-marie""|""2002""|""two thousand...",29641412,394830,8892,19988,https://i.ytimg.com/vi/Il-an3K9pjg/default.jpg,False,False,False,Get 2002 by Anne-Marie HERE ▶ http://ad.gt/200...
38914,GB,-DRsfNObKIQ,18.14.06,Eleni Foureira - Fuego - Cyprus - LIVE - First...,Eurovision Song Contest,24,2018-05-08T20:32:32.000Z,"Eurovision Song Contest|""2018""|""Lisbon""|""Cypru...",14317515,151870,45875,26766,https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg,False,False,False,Eleni Foureira represented Cyprus at the first...


# Now some descriptive analytics

### What channels have the most trending videos?

In [13]:
CanUK.groupby('channel_title').channel_title.count().sort_values(ascending = False)

channel_title
The Late Show with Stephen Colbert        361
TheEllenShow                              357
Jimmy Kimmel Live                         351
The Tonight Show Starring Jimmy Fallon    345
Breakfast Club Power 105.1 FM             342
                                         ... 
Ghost Adventures Season 16                  1
Global Savage                               1
GloomyHouse                                 1
GoCanucksGo                                 1
활력소TV                                       1
Name: channel_title, Length: 5956, dtype: int64

### What are the quantitative descriptive statistics for TheEllenShow?

In [14]:
CanUK.groupby('channel_title').views.sum().sort_values(ascending = False)

channel_title
NickyJamTV                    8758841250
Ozuna                         8447128427
Bad Bunny                     6981462307
DrakeVEVO                     6772018167
ChildishGambinoVEVO           6513084682
                                 ...    
Boston Celtics on MassLive          1637
Qc TV HD                            1569
Mathieu Désy                        1464
Georgia Webster                     1187
udearroba                           1141
Name: views, Length: 5956, dtype: int64

In [17]:
#ellen show answer

CanUK[CanUK['channel_title'] == 'TheEllenShow'].describe()

Unnamed: 0,category_id,views,likes,dislikes,comment_count
count,357.0,357.0,357.0,357.0,357.0
mean,24.0,1779820.0,46257.770308,1328.537815,2179.221289
std,0.0,2008212.0,62550.222512,3198.710961,3614.375611
min,24.0,56313.0,1187.0,18.0,0.0
25%,24.0,536052.0,14152.0,243.0,318.0
50%,24.0,1238747.0,28108.0,598.0,940.0
75%,24.0,2022094.0,59950.0,1309.0,2858.0
max,24.0,13592650.0,499673.0,26397.0,28381.0


##### Alternatively, you can use `loc[]` to peform the filtering operation
>- The use of `where()` or `loc[]` depends on the question/purpose or sometimes just personal preference

In [18]:
CanUK[CanUK['channel_title'] == 'TheEllenShow'].describe()

Unnamed: 0,category_id,views,likes,dislikes,comment_count
count,357.0,357.0,357.0,357.0,357.0
mean,24.0,1779820.0,46257.770308,1328.537815,2179.221289
std,0.0,2008212.0,62550.222512,3198.710961,3614.375611
min,24.0,56313.0,1187.0,18.0,0.0
25%,24.0,536052.0,14152.0,243.0,318.0
50%,24.0,1238747.0,28108.0,598.0,940.0
75%,24.0,2022094.0,59950.0,1309.0,2858.0
max,24.0,13592650.0,499673.0,26397.0,28381.0


In [20]:
CanUK.where(CanUK['channel_title'] == 'TheEllenShow').describe()

Unnamed: 0,category_id,views,likes,dislikes,comment_count,comments_disabled,ratings_disabled,video_error_or_removed
count,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0
mean,24.0,1779820.0,46257.770308,1328.537815,2179.221289,0.170868,0.0,0.0
std,0.0,2008212.0,62550.222512,3198.710961,3614.375611,0.376922,0.0,0.0
min,24.0,56313.0,1187.0,18.0,0.0,0.0,0.0,0.0
25%,24.0,536052.0,14152.0,243.0,318.0,0.0,0.0,0.0
50%,24.0,1238747.0,28108.0,598.0,940.0,0.0,0.0,0.0
75%,24.0,2022094.0,59950.0,1309.0,2858.0,0.0,0.0,0.0
max,24.0,13592650.0,499673.0,26397.0,28381.0,1.0,0.0,0.0


### What were the total YouTube videos, total views, likes and dislikes for TheEllenShow?
>- Using the agg() function to calculate specific aggregations on different columns

In [24]:
CanUK[CanUK['channel_title'] == 'TheEllenShow']\
.agg({'video_id':['count'],'views':['sum'],'likes':['sum','mean'],'dislikes':['sum']})

Unnamed: 0,video_id,views,likes,dislikes
count,357.0,,,
sum,,635395636.0,16514020.0,474288.0
mean,,,46257.77,


## What are the totals for TheEllenShow's top 5 most viewed videos?
>- Only include the title names as part of the output (not channel or any other categorical fields)
>- Include total views, likes, dislikes, and comment count in the output

In [None]:
CanUK
#dat.col.sum() basically

In [29]:
CanUK[CanUK.channel_title == 'TheEllenShow']\
.groupby('title')\
[['title','views','likes','dislikes','comment_count']]\
.sum()\
.sort_values(by = 'views',ascending = False)[0:5]

Unnamed: 0_level_0,views,likes,dislikes,comment_count
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jennifer Lawrence Explains Her Drunk Alter Ego 'Gail',71621685,1302161,35255,49762
Kid Yodeler Mason Ramsey Performs,68553312,2697959,144604,0
Kim Kardashian Lets Gender of Third Child Slip,52985123,776177,26509,50940
Billionaire Bill Gates Guesses Grocery Store Prices,37076353,542804,29036,42720
Laurel or Yanny?,28695358,613487,18568,100731


# Some Notes on the Previous Example
>- Our pandas code in the previous example is similar to SQL in the following ways
    1. `loc[CanUk.channel_title == 'TheEllenShow',` is SQL equivalent to `WHERE channel_title = 'TheEllenShow'`
    2. `['title','views','likes','dislikes','comment_count']` is SQL equivalent to:
        `SELECT title, sum(views),sum(likes),sum(dislikes),sum(comment_count)`
    3. `groupby(['title`]) is SQL equivalent to GROUP BY title
    4. Now in pandas we enter the aggregation after the `groupby()`, in this example `sum()`
      >>- In SQL we write the aggregation in the SELECT statement
      
## In future lessons we will continue to learn how pandas and SQL relate