# Solutions

1. [Automatic Index Alignment](#1.-Automatic-Index-Alignment)
1. [Concatenating Data](#2.-Concatenating-Data)
1. [Joining DataFrames](#3.-Joining-DataFrames)

## 1. Automatic Index Alignment

### Exercise 1

<span style="color:green; font-size:16px">Create two Series of integers with no missing values. Make one with 4 values and the other with five. When added together, they result should be a Series with 10 values, 2 of which are missing.</span>

In [1]:
import pandas as pd
import numpy as np
s1 = pd.Series(index=['a', 'a', 'a', 'b', 'b', 'd'], data=[1, 2, 3, 4, 5, 6])
s2 = pd.Series(index=['a', 'a', 'b', 'c'], data=[1, 2, 3, 4])
s1 + s2

a    2.0
a    3.0
a    3.0
a    4.0
a    4.0
a    5.0
b    7.0
b    8.0
c    NaN
d    NaN
dtype: float64

In [2]:
len(s1 + s2)

10

### Exercise 2

<span style="color:green; font-size:16px">Create two Series of integers, each with three values, but with a non-identical index. When added together, the result should be a Series with three values.</span>

In [3]:
s1 = pd.Series(index=['a', 'b', 'c'], data=[1, 2, 3])
s2 = pd.Series(index=['c', 'b', 'a'], data=[4, 8, 3])
s1 + s2

a     4
b    10
c     7
dtype: int64

### Exercise 3

<span style="color:green; font-size:16px">Add two Series together containing integers resulting in a new Series with all missing values.</span>

In [4]:
s1 = pd.Series(index=['a', 'b', 'c'], data=[1, 2, 3])
s2 = pd.Series(index=['d', 'e', 'f'], data=[4, 8, 3])
s1 + s2

a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
f   NaN
dtype: float64

### Exercise 4

<span style="color:green; font-size:16px">You add two Series together, one with four values, and the other with five. Each index label is the same. For instance, all labels for all Series could be `'a'`. How many total values would be in the resulting Series? Answer the question without pandas, and then check your work with it.</span>

In [5]:
# 20 values - cartesian product 5 * 4
s1 = pd.Series(index=['a', 'a', 'a', 'a', 'a'], data=[1, 2, 3, 4, 5])
s2 = pd.Series(index=['a', 'a', 'a', 'a'], data=[4, 8, 3, 4])
s1 + s2

a     5
a     9
a     4
a     5
a     6
a    10
a     5
a     6
a     7
a    11
a     6
a     7
a     8
a    12
a     7
a     8
a     9
a    13
a     8
a     9
dtype: int64

In [6]:
len(s1 + s2)

20

### Exercise 5

<span style="color:green; font-size:16px">Can you determine the shape of the resulting addition between the following two DataFrames before completing the operation?</span>

In [7]:
df1 = pd.DataFrame(data=np.random.randint(1, 6, (4, 8)), 
                   index=['a', 'b', 'b', 'd'], 
                   columns=['a', 'b', 'c', 'd', 
                            'e', 'f', 'g', 'h'])
df2 = pd.DataFrame(data=np.random.randint(1, 6, (6, 5)),
                   index=['a', 'b', 'b', 'b', 'b', 'c'],
                   columns=['a', 'b', 'c', 'd', 'e'])
display(df1, df2)

Unnamed: 0,a,b,c,d,e,f,g,h
a,5,3,5,4,5,5,4,5
b,2,4,2,1,1,4,4,2
b,1,4,5,1,3,1,4,1
d,1,3,2,5,2,2,4,4


Unnamed: 0,a,b,c,d,e
a,3,5,5,1,4
b,4,3,4,4,1
b,5,4,3,4,3
b,1,5,5,1,2
b,1,4,2,5,5
c,2,5,2,2,4


In [8]:
# 11 by 8
(df1 + df2).shape

(11, 8)

### Exercise 6

<span style="color:green; font-size:16px">Read in the sample dataset (`sample_data.csv`), placing the name column in the index. Create a new Series with two values and use it to create a new column in the DataFrame. Make it such that only one of the values appears in the resulting DataFrame.</span>

In [9]:
df = pd.read_csv('../data/sample_data.csv', index_col='name')
s = pd.Series(index=['Dean', 'Thomas'], data=[99, 2])
df['new_values'] = s
df

Unnamed: 0_level_0,state,color,food,age,height,score,new_values
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Jane,NY,blue,Steak,30,165,4.6,
Niko,TX,green,Lamb,2,70,8.3,
Aaron,FL,red,Mango,12,120,9.0,
Penelope,AL,white,Apple,4,80,3.3,
Dean,AK,gray,Cheese,32,180,1.8,99.0
Christina,TX,black,Melon,33,172,9.5,
Cornelia,TX,red,Beans,69,150,2.2,


### Exercise 7

<span style="color:green; font-size:16px">Read in the x, y, and z columns from the diamonds dataset. Find the mean of each row and subtract that value from each value in the row.</span>

In [10]:
diamonds = pd.read_csv('../data/diamonds.csv', usecols=['x', 'y', 'z'])
diamonds.head(3)

Unnamed: 0,x,y,z
0,3.95,3.98,2.43
1,3.89,3.84,2.31
2,4.05,4.07,2.31


In [11]:
diamonds.sub(diamonds.mean(axis=1), axis=0).head(3)

Unnamed: 0,x,y,z
0,0.496667,0.526667,-1.023333
1,0.543333,0.493333,-1.036667
2,0.573333,0.593333,-1.166667


## 2. Concatenating Data

### Exercise 1

<span style="color:green; font-size:16px">Read in all of the files in the `../data/weather` directory except `city_attributes.csv` as DataFrames placing the `datetime` column in the index. Store the data in a dictionary using the file name (without extension) as the key. Concatenate the DataFrames vertically labeling each DataFrame in the index appropriately.</span>

In [12]:
from pathlib import Path
weather_data_path = Path('../data/weather')
paths = sorted(weather_data_path.glob('*.csv'))
dfs = {}
for path in paths:
    weather_type = path.stem
    if weather_type == 'city_attributes':
        continue
    dfs[weather_type] = pd.read_csv(path, parse_dates=['datetime'], index_col='datetime')
df_weather = pd.concat(dfs, names=['weather_type'])
df_weather.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Seattle,San Francisco,Los Angeles,Las Vegas,Denver,Houston,Chicago,Atlanta,Miami,New York
weather_type,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
humidity,2013-01-01 00:00:00,75.0,58.0,,42.0,46.0,70.0,,55.0,56.0,
humidity,2013-01-01 01:00:00,80.0,62.0,,42.0,50.0,76.0,64.0,64.0,63.0,54.0
humidity,2013-01-01 02:00:00,86.0,70.0,43.0,52.0,58.0,76.0,69.0,,63.0,54.0
humidity,2013-01-01 03:00:00,,70.0,,52.0,53.0,76.0,,86.0,63.0,54.0
humidity,2013-01-01 04:00:00,,70.0,53.0,52.0,53.0,81.0,68.0,80.0,63.0,54.0


In [13]:
df_weather.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Seattle,San Francisco,Los Angeles,Las Vegas,Denver,Houston,Chicago,Atlanta,Miami,New York
weather_type,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
wind_speed,2016-12-31 19:00:00,4.0,3.0,3.0,1.0,2.0,2.0,6.0,1.0,6.0,4.0
wind_speed,2016-12-31 20:00:00,3.0,3.0,1.0,1.0,1.0,3.0,7.0,1.0,6.0,3.0
wind_speed,2016-12-31 21:00:00,3.0,4.0,4.0,1.0,1.0,3.0,7.0,2.0,5.0,7.0
wind_speed,2016-12-31 22:00:00,4.0,3.0,1.0,1.0,1.0,1.0,5.0,3.0,5.0,4.0
wind_speed,2016-12-31 23:00:00,4.0,3.0,1.0,1.0,1.0,1.0,8.0,3.0,5.0,5.0


### Exercise 2

<span style="color:green; font-size:16px">Use the dictionary created in Exercise 1 to concatenate the DataFrames horizontally.</span>

In [14]:
df_weather = pd.concat(dfs, names=['weather_type', 'city'], axis=1)
df_weather.head()

weather_type,humidity,humidity,humidity,humidity,humidity,humidity,humidity,humidity,humidity,humidity,...,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed
city,Seattle,San Francisco,Los Angeles,Las Vegas,Denver,Houston,Chicago,Atlanta,Miami,New York,...,Seattle,San Francisco,Los Angeles,Las Vegas,Denver,Houston,Chicago,Atlanta,Miami,New York
datetime,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2013-01-01 00:00:00,75.0,58.0,,42.0,46.0,70.0,,55.0,56.0,,...,2.0,5.0,3.0,1.0,7.0,3.0,4.0,0.0,3.0,13.0
2013-01-01 01:00:00,80.0,62.0,,42.0,50.0,76.0,64.0,64.0,63.0,54.0,...,0.0,3.0,1.0,3.0,7.0,3.0,3.0,0.0,4.0,11.0
2013-01-01 02:00:00,86.0,70.0,43.0,52.0,58.0,76.0,69.0,,63.0,54.0,...,0.0,2.0,2.0,1.0,4.0,4.0,6.0,0.0,4.0,8.0
2013-01-01 03:00:00,,70.0,,52.0,53.0,76.0,,86.0,63.0,54.0,...,0.0,3.0,1.0,1.0,3.0,3.0,7.0,0.0,6.0,9.0
2013-01-01 04:00:00,,70.0,53.0,52.0,53.0,81.0,68.0,80.0,63.0,54.0,...,0.0,3.0,0.0,3.0,4.0,2.0,7.0,0.0,5.0,8.0


### Exercise 3

<span style="color:green; font-size:16px">Write a function that accepts the dictionary created from Exercise 1 and a city name. Return a DataFrame of all of the weather metrics for that city for each day. The index will be the datetime and the columns will be each individual weather metric.</span>

In [15]:
def get_city_weather(dfs, city):
    s_dict = {}
    for weather_type, df in dfs.items():
        s_dict[weather_type] = df[city]
    return pd.concat(s_dict, axis=1)

In [16]:
get_city_weather(dfs, 'Houston').head()

Unnamed: 0_level_0,humidity,pressure,temperature,weather_description,wind_direction,wind_speed
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2013-01-01 00:00:00,70.0,1026.0,8.81,broken clouds,100.0,3.0
2013-01-01 01:00:00,76.0,1026.0,8.81,broken clouds,90.0,3.0
2013-01-01 02:00:00,76.0,1026.0,8.81,overcast clouds,80.0,4.0
2013-01-01 03:00:00,76.0,1026.0,8.48,broken clouds,90.0,3.0
2013-01-01 04:00:00,81.0,1025.0,8.34,sky is clear,90.0,2.0


In [17]:
get_city_weather(dfs, 'Denver').head()

Unnamed: 0_level_0,humidity,pressure,temperature,weather_description,wind_direction,wind_speed
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2013-01-01 00:00:00,46.0,1014.0,-2.3,broken clouds,320.0,7.0
2013-01-01 01:00:00,50.0,1014.0,-3.23,broken clouds,340.0,7.0
2013-01-01 02:00:00,58.0,1015.0,-3.03,broken clouds,350.0,4.0
2013-01-01 03:00:00,53.0,1015.0,-3.67,broken clouds,330.0,3.0
2013-01-01 04:00:00,53.0,1015.0,-5.55,broken clouds,320.0,4.0


### Exercise 4

<span style="color:green; font-size:16px">Iterate through all of the DataFrames in the dictionary created in Exercise 1 and use the `assign` method to add a column for the `weather_type`. Save the results in a **list** named `dfs_list`.</span>

In [18]:
dfs_list = [df.assign(weather_type=wt) for wt, df in dfs.items()]

In [19]:
dfs_list[0].head()

Unnamed: 0_level_0,Seattle,San Francisco,Los Angeles,Las Vegas,Denver,Houston,Chicago,Atlanta,Miami,New York,weather_type
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2013-01-01 00:00:00,75.0,58.0,,42.0,46.0,70.0,,55.0,56.0,,humidity
2013-01-01 01:00:00,80.0,62.0,,42.0,50.0,76.0,64.0,64.0,63.0,54.0,humidity
2013-01-01 02:00:00,86.0,70.0,43.0,52.0,58.0,76.0,69.0,,63.0,54.0,humidity
2013-01-01 03:00:00,,70.0,,52.0,53.0,76.0,,86.0,63.0,54.0,humidity
2013-01-01 04:00:00,,70.0,53.0,52.0,53.0,81.0,68.0,80.0,63.0,54.0,humidity


### Exercise 5

<span style="color:green; font-size:16px">Concatenate all of the DataFrames in `dfs_list` vertically using the `append` method. Create a single DataFrame.</span>

In [20]:
df_all = dfs_list[0]
for df in dfs_list[1:]:
    df_all = df_all.append(df)

In [21]:
df_all.tail()

Unnamed: 0_level_0,Seattle,San Francisco,Los Angeles,Las Vegas,Denver,Houston,Chicago,Atlanta,Miami,New York,weather_type
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2016-12-31 19:00:00,4.0,3.0,3.0,1.0,2.0,2.0,6.0,1.0,6.0,4.0,wind_speed
2016-12-31 20:00:00,3.0,3.0,1.0,1.0,1.0,3.0,7.0,1.0,6.0,3.0,wind_speed
2016-12-31 21:00:00,3.0,4.0,4.0,1.0,1.0,3.0,7.0,2.0,5.0,7.0,wind_speed
2016-12-31 22:00:00,4.0,3.0,1.0,1.0,1.0,1.0,5.0,3.0,5.0,4.0,wind_speed
2016-12-31 23:00:00,4.0,3.0,1.0,1.0,1.0,1.0,8.0,3.0,5.0,5.0,wind_speed


### Exercise 6

<span style="color:green; font-size:16px">Select the temperature data from the dictionary created in Exercise 1. Find the mean, median, min, and max temperature per month for the city of Houston. Then create a new column named `mean_median_diff` that contains the absolute difference between the mean and median. Also create the column `min_max_diff` that rounds the absolute difference between the min and max temperatures. Use one line of code.</span>

In [22]:
(dfs['temperature']
    .resample('M')['Houston'].agg(['mean', 'median', 'min', 'max'])
    .assign(mean_median_diff=lambda df: (df['mean'] - df['median']).abs(),
            min_max_diff=lambda df: (df['min'] - df['max']).abs().round(0))
    .head())

Unnamed: 0_level_0,mean,median,min,max,mean_median_diff,min_max_diff
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2013-01-31,12.885014,11.93,0.1,25.57,0.955014,25.0
2013-02-28,14.872489,15.06,3.3,25.97,0.187511,23.0
2013-03-31,16.011552,16.625,1.9,31.64,0.613448,30.0
2013-04-30,19.001104,19.725,5.24,28.07,0.723896,23.0
2013-05-31,23.581378,24.145,7.57,32.21,0.563622,25.0


## 3. Joining DataFrames

In [23]:
import pandas as pd
CS = 'sqlite:///../data/databases/chinook.db'
tracks = pd.read_sql('tracks', CS)
genres = pd.read_sql('genres', CS)
albums = pd.read_sql('albums', CS)
artists = pd.read_sql('artists', CS)
media_types = pd.read_sql('media_types', CS)
playlist_track = pd.read_sql('playlist_track', CS)
playlists = pd.read_sql('playlists', CS)
invoice_items = pd.read_sql('invoice_items', CS)
invoices = pd.read_sql('invoices', CS)
customers = pd.read_sql('customers', CS)
employees = pd.read_sql('employees', CS)

### Exercise 1

<span style="color:green; font-size:16px">Find the occurrences of each media type in the tracks table. Use the name of the media type.</span>

In [24]:
df = tracks.merge(media_types, how='inner',
                  on='MediaTypeId', suffixes=('_track', '_mt'))
df.head(3)

Unnamed: 0,TrackId,Name_track,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice,Name_mt
0,1,For Those About To Rock (We Salute You),1,1,1,"Angus Young, Malcolm Young, Brian Johnson",343719,11170334,0.99,MPEG audio file
1,6,Put The Finger On You,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",205662,6713451,0.99,MPEG audio file
2,7,Let's Get It Up,1,1,1,"Angus Young, Malcolm Young, Brian Johnson",233926,7636561,0.99,MPEG audio file


In [25]:
df['Name_mt'].value_counts()

MPEG audio file                3034
Protected AAC audio file        237
Protected MPEG-4 video file     214
AAC audio file                   11
Purchased AAC audio file          7
Name: Name_mt, dtype: int64

Alternatively, find the count of the mediatypeid first.

In [26]:
mt_count = tracks['MediaTypeId'].value_counts()
mt_count

1    3034
2     237
3     214
5      11
4       7
Name: MediaTypeId, dtype: int64

DataFrames can be joined to a Series index.

In [27]:
media_types.merge(mt_count, left_on='MediaTypeId', right_index=True)

Unnamed: 0,MediaTypeId,MediaTypeId_x,Name,MediaTypeId_y
0,1,1,MPEG audio file,3034
1,2,2,Protected AAC audio file,237
2,3,3,Protected MPEG-4 video file,214
3,4,4,Purchased AAC audio file,7
4,5,5,AAC audio file,11


### Exercise 2

<span style="color:green; font-size:16px">Are there any playlists that have no tracks? If so, which ones are they? Use `merge` in your solution.</span>

In [28]:
playlists.head()

Unnamed: 0,PlaylistId,Name
0,1,Music
1,2,Movies
2,3,TV Shows
3,4,Audiobooks
4,5,90’s Music


In [29]:
playlist_track.head()

Unnamed: 0,PlaylistId,TrackId
0,1,3402
1,1,3389
2,1,3390
3,1,3391
4,1,3392


In [30]:
df = playlists.merge(playlist_track, how='left', on='PlaylistId', indicator=True)
df.head()

Unnamed: 0,PlaylistId,Name,TrackId,_merge
0,1,Music,3402.0,both
1,1,Music,3389.0,both
2,1,Music,3390.0,both
3,1,Music,3391.0,both
4,1,Music,3392.0,both


Check to see if any are left_only

In [31]:
df['_merge'].value_counts()

both          8715
left_only        4
right_only       0
Name: _merge, dtype: int64

Yes, there are four playlists without any tracks. Let's filter for them.

In [32]:
df.query('_merge == "left_only"')

Unnamed: 0,PlaylistId,Name,TrackId,_merge
3290,2,Movies,,left_only
3504,4,Audiobooks,,left_only
4982,6,Audiobooks,,left_only
4983,7,Movies,,left_only


### Exercise 3

<span style="color:green; font-size:16px">Find the number of tracks per playlist. Use the playlist name in the result. Some playlists have the same name. Make sure not to combine them.</span>

In [33]:
playlists.head()

Unnamed: 0,PlaylistId,Name
0,1,Music
1,2,Movies
2,3,TV Shows
3,4,Audiobooks
4,5,90’s Music


In [34]:
# there shold be 18 playlists in the result
playlists.shape

(18, 2)

In [35]:
playlist_track.head()

Unnamed: 0,PlaylistId,TrackId
0,1,3402
1,1,3389
2,1,3390
3,1,3391
4,1,3392


In [36]:
df = playlists.merge(playlist_track, how='left', on='PlaylistId')
df.head()

Unnamed: 0,PlaylistId,Name,TrackId
0,1,Music,3402.0
1,1,Music,3389.0
2,1,Music,3390.0
3,1,Music,3391.0
4,1,Music,3392.0


Some playlists have 0 songs, which is why a left join must be done. Use `count` to count non-missing values.

In [37]:
df.groupby(['PlaylistId', 'Name'])['TrackId'].count().sort_values(ascending=False)

PlaylistId  Name                      
1           Music                         3290
8           Music                         3290
5           90’s Music                    1477
10          TV Shows                       213
3           TV Shows                       213
12          Classical                       75
11          Brazilian Music                 39
17          Heavy Metal Classic             26
13          Classical 101 - Deep Cuts       25
14          Classical 101 - Next Steps      25
15          Classical 101 - The Basics      25
16          Grunge                          15
9           Music Videos                     1
18          On-The-Go 1                      1
4           Audiobooks                       0
6           Audiobooks                       0
7           Movies                           0
2           Movies                           0
Name: TrackId, dtype: int64

### Exercise 4

<span style="color:green; font-size:16px">Find the number of invoices per customer. Show the customer id, first name, and last name and count of invoices.</span>

In [38]:
(invoices.merge(customers, how='inner', on='CustomerId')
         .groupby(['CustomerId', 'FirstName', 'LastName'])).size().head()

CustomerId  FirstName  LastName   
1           Luís       Gonçalves      7
2           Leonie     Köhler         7
3           François   Tremblay       7
4           Bjørn      Hansen         7
5           František  Wichterlová    7
dtype: int64

In [39]:
(customers.merge(invoices, how='inner', on='CustomerId')
          .merge(invoice_items, how='inner', on='InvoiceId')
          .groupby(['CustomerId', 'FirstName', 'LastName']).size().head()
)

CustomerId  FirstName  LastName   
1           Luís       Gonçalves      38
2           Leonie     Köhler         38
3           François   Tremblay       38
4           Bjørn      Hansen         38
5           František  Wichterlová    38
dtype: int64

### Exercise 5

<span style="color:green; font-size:16px">How many customers is each employee responsible for.</span>

In [40]:
(employees.merge(customers, how='left', 
                left_on='EmployeeId', right_on='SupportRepId')
          .value_counts('EmployeeId'))

EmployeeId
3    21
4    20
5    18
1     1
2     1
6     1
7     1
8     1
dtype: int64

### Exercise 6

<span style="color:green; font-size:16px">Find all of the tracks with the same name as the album title.</span>

In [41]:
(tracks.merge(albums, how='inner', on='AlbumId')
       .query('Name == Title'))[['TrackId', 'Name', 'Title']].head(10)

Unnamed: 0,TrackId,Name,Title
10,2,Balls to the Wall,Balls to the Wall
12,4,Restless and Wild,Restless and Wild
16,17,Let There Be Rock,Let There Be Rock
99,100,Out Of Exile,Out Of Exile
148,149,Black Sabbath,Black Sabbath
168,169,Body Count,Body Count
183,184,Chemical Wedding,Chemical Wedding
205,206,Prenda Minha,Prenda Minha
236,237,Minha Historia,Minha Historia
288,275,Da Lama Ao Caos,Da Lama Ao Caos


### Exercise 7

<span style="color:green; font-size:16px">Find the top 10 tracks by length of song (Milliseconds).</span>

In [42]:
# don't need to join
tracks.nlargest(10, 'Milliseconds')

Unnamed: 0,TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
2819,2820,Occupation / Precipice,227,3,19,,5286953,1054423946,1.99
3223,3224,Through a Looking Glass,229,3,21,,5088838,1059546140,1.99
3243,3244,"Greetings from Earth, Pt. 1",253,3,20,,2960293,536824558,1.99
3241,3242,The Man With Nine Lives,253,3,20,,2956998,577829804,1.99
3226,3227,"Battlestar Galactica, Pt. 2",253,3,20,,2956081,521387924,1.99
3225,3226,"Battlestar Galactica, Pt. 1",253,3,20,,2952702,541359437,1.99
3242,3243,Murder On the Rising Star,253,3,20,,2935894,551759986,1.99
3227,3228,"Battlestar Galactica, Pt. 3",253,3,20,,2927802,554509033,1.99
3247,3248,Take the Celestra,253,3,20,,2927677,512381289,1.99
3238,3239,Fire In Space,253,3,20,,2926593,536784757,1.99


### Exercise 8

<span style="color:green; font-size:16px">Are there any genres that do not appear in the tracks table? If so, which ones are they? Use `merge` in your solution.</span>

In [43]:
df = genres.merge(tracks, how='left', on='GenreId', indicator=True)
df.head()

Unnamed: 0,GenreId,Name_x,TrackId,Name_y,AlbumId,MediaTypeId,Composer,Milliseconds,Bytes,UnitPrice,_merge
0,1,Rock,1,For Those About To Rock (We Salute You),1,1,"Angus Young, Malcolm Young, Brian Johnson",343719,11170334,0.99,both
1,1,Rock,2,Balls to the Wall,2,2,,342562,5510424,0.99,both
2,1,Rock,3,Fast As a Shark,3,2,"F. Baltes, S. Kaufman, U. Dirkscneider & W. Ho...",230619,3990994,0.99,both
3,1,Rock,4,Restless and Wild,3,2,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D...",252051,4331779,0.99,both
4,1,Rock,5,Princess of the Dawn,3,2,Deaffy & R.A. Smith-Diesel,375418,6290521,0.99,both


Count `_merge` column - No, all genres appear in tracks table.

In [44]:
df['_merge'].value_counts()

both          3503
left_only        0
right_only       0
Name: _merge, dtype: int64

### Exercise 9

<span style="color:green; font-size:16px">Count the number of albums per artist. Make sure to include artists that do not have any albums.</span>

In [45]:
df = artists.merge(albums, on='ArtistId', how='left')
df.head()

Unnamed: 0,ArtistId,Name,AlbumId,Title
0,1,AC/DC,1.0,For Those About To Rock We Salute You
1,1,AC/DC,4.0,Let There Be Rock
2,2,Accept,2.0,Balls to the Wall
3,2,Accept,3.0,Restless and Wild
4,3,Aerosmith,5.0,Big Ones


In [46]:
# note that some artists have 0 albums
df.groupby('Name')['AlbumId'].count().head(10)

Name
A Cor Do Som                                                                             0
AC/DC                                                                                    2
Aaron Copland & London Symphony Orchestra                                                1
Aaron Goldberg                                                                           1
Academy of St. Martin in the Fields & Sir Neville Marriner                               1
Academy of St. Martin in the Fields Chamber Ensemble & Sir Neville Marriner              1
Academy of St. Martin in the Fields, John Birch, Sir Neville Marriner & Sylvia McNair    1
Academy of St. Martin in the Fields, Sir Neville Marriner & Thurston Dart                1
Academy of St. Martin in the Fields, Sir Neville Marriner & William Bennett              0
Accept                                                                                   2
Name: AlbumId, dtype: int64

### Exercise 10

<span style="color:green; font-size:16px">Find the cost of each playlist. Include playlists with zero tracks.</span>

In [47]:
t = tracks[['TrackId', 'UnitPrice']]
df = (playlists.merge(playlist_track, how='left', on='PlaylistId')
               .merge(t, how='left', on='TrackId'))
df.head()

Unnamed: 0,PlaylistId,Name,TrackId,UnitPrice
0,1,Music,3402.0,0.99
1,1,Music,3389.0,0.99
2,1,Music,3390.0,0.99
3,1,Music,3391.0,0.99
4,1,Music,3392.0,0.99


In [48]:
df.groupby(['PlaylistId', 'Name'])['UnitPrice'].agg(['count', 'sum'])

Unnamed: 0_level_0,Unnamed: 1_level_0,count,sum
PlaylistId,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Music,3290,3257.1
2,Movies,0,0.0
3,TV Shows,213,423.87
4,Audiobooks,0,0.0
5,90’s Music,1477,1462.23
6,Audiobooks,0,0.0
7,Movies,0,0.0
8,Music,3290,3257.1
9,Music Videos,1,0.99
10,TV Shows,213,423.87


### Exercise 11

<span style="color:green; font-size:16px">Count the total number of times each track was sold and return the top 10 tracks.</span>

In [49]:
(invoice_items.merge(tracks, how='inner', on='TrackId')
              .groupby('Name')['Quantity'].sum()
              .nlargest(10))

Name
The Trooper                5
Eruption                   4
Hallowed Be Thy Name       4
Sure Know Something        4
The Number Of The Beast    4
Untitled                   4
2 Minutes To Midnight      3
Blood Brothers             3
Brasil                     3
Can I Play With Madness    3
Name: Quantity, dtype: int64

### Exercise 12

<span style="color:green; font-size:16px">Create a pivot table with billing country and genre as the index and columns and the number of tracks sold as the values.</span>

In [50]:
# create simpler tables first with minimum number of columns
i = invoices[['InvoiceId', 'BillingCountry']]
ii = invoice_items[['InvoiceId', 'TrackId']]
t = tracks[['TrackId', 'GenreId']]
df = (i.merge(ii, how='inner', on='InvoiceId')
       .merge(t, how='inner', on='TrackId')
       .merge(genres, how='inner', on='GenreId'))
df.head()

Unnamed: 0,InvoiceId,BillingCountry,TrackId,GenreId,Name
0,1,Germany,2,1,Rock
1,214,Canada,2,1,Rock
2,1,Germany,4,1,Rock
3,2,Norway,6,1,Rock
4,2,Norway,8,1,Rock


In [51]:
pd.crosstab(index=df['BillingCountry'], columns=df['Name']).head()

Name,Alternative,Alternative & Punk,Blues,Bossa Nova,Classical,Comedy,Drama,Easy Listening,Electronica/Dance,Heavy Metal,...,Pop,R&B/Soul,Reggae,Rock,Rock And Roll,Sci Fi & Fantasy,Science Fiction,Soundtrack,TV Shows,World
BillingCountry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Argentina,0,9,0,0,0,0,0,2,0,0,...,0,0,0,9,0,0,0,1,0,0
Australia,0,0,1,0,0,0,0,0,0,3,...,0,0,2,22,0,0,0,0,0,0
Austria,0,0,0,0,2,0,1,0,0,0,...,1,4,0,15,0,0,0,0,4,0
Belgium,0,14,0,0,0,0,0,0,0,0,...,0,2,0,21,0,0,0,0,0,0
Brazil,0,7,6,0,6,0,0,0,0,0,...,3,3,6,81,0,2,0,4,0,2


### Exercise 13

<span  style="color:green; font-size:16px">Find the name and email of each employee's boss. Make use of the suffix arguments to better label the merged data. Be sure to include employees that don't have bosses. This is called a recursive relationship.</span>

In [52]:
e = employees[['EmployeeId', 'FirstName', 'LastName', 'ReportsTo']]
boss = employees[['EmployeeId', 'FirstName', 'LastName', 'Email']]
e.merge(boss, how='left', left_on='ReportsTo', 
        right_on='EmployeeId', suffixes=('_employee', '_boss'))

Unnamed: 0,EmployeeId_employee,FirstName_employee,LastName_employee,ReportsTo,EmployeeId_boss,FirstName_boss,LastName_boss,Email
0,1,Andrew,Adams,,,,,
1,2,Nancy,Edwards,1.0,1.0,Andrew,Adams,andrew@chinookcorp.com
2,3,Jane,Peacock,2.0,2.0,Nancy,Edwards,nancy@chinookcorp.com
3,4,Margaret,Park,2.0,2.0,Nancy,Edwards,nancy@chinookcorp.com
4,5,Steve,Johnson,2.0,2.0,Nancy,Edwards,nancy@chinookcorp.com
5,6,Michael,Mitchell,1.0,1.0,Andrew,Adams,andrew@chinookcorp.com
6,7,Robert,King,6.0,6.0,Michael,Mitchell,michael@chinookcorp.com
7,8,Laura,Callahan,6.0,6.0,Michael,Mitchell,michael@chinookcorp.com


### Exercise 14

<span style="color:green; font-size:16px">Find the average length of tracks for each artist for those with at least 10 tracks. Return five artists with the longest average track length.</span>

In [53]:
a = artists.rename(columns={'Name': 'ArtistName'})
t = tracks[['AlbumId', 'Milliseconds']]
df = (a.merge(albums, how='inner', on='ArtistId')
       .merge(t, how='inner', on='AlbumId'))
df.head()

Unnamed: 0,ArtistId,ArtistName,AlbumId,Title,Milliseconds
0,1,AC/DC,1,For Those About To Rock We Salute You,343719
1,1,AC/DC,1,For Those About To Rock We Salute You,205662
2,1,AC/DC,1,For Those About To Rock We Salute You,233926
3,1,AC/DC,1,For Those About To Rock We Salute You,210834
4,1,AC/DC,1,For Those About To Rock We Salute You,203102


In [54]:
df2 = (df.groupby('ArtistName')['Milliseconds'].agg(['count', 'mean'])
         .query('count >= 10')
         .sort_values('mean', ascending=False))
df2['mean'] = df2['mean'] / (60 * 1000)
df2.head()

Unnamed: 0_level_0,count,mean
ArtistName,Unnamed: 1_level_1,Unnamed: 2_level_1
Battlestar Galactica (Classic),24,48.759572
Battlestar Galactica,20,46.174409
Heroes,23,43.319035
Lost,92,43.16641
The Office,53,23.56241
