# SI 330 - Lab 10: Taking a Step Back

## Submission Instructions:
Please turn in this Jupyter notebook file (both .ipynb and .html formats) on Canvas.

### Name:  Samantha Cohen
### Uniqname: samcoh
### People you worked with: Emil, Rhea, and Will 

## Objectives:

* Practice translating between data manipulation frameworks

In each question in this lab, we will give you code in one framework, and you will write it out in another framework.

## Part 0: Setup

In [1]:
import sqlite3 as sqlite
import numpy as np
import pandas as pd
from collections import defaultdict

### Part 0.1: SQLite Setup

Way back in HW4, we used an SQLite database called `chinook.db`. That database contains the following tables:

- `artists`: Musical artists, each with an artist id and name.
- `albums`: Musical albums. Each album belongs to one artist. However, one artist may have multiple albums.
- `tracks`: Songs. Each track belongs to one album.
- `invoices` and `invoice_items`: Purchase invoices. Each invoice represents a single purchase, associated with one or more invoice items. Each invoice item represents the purchase of a single track, the dollar amount paid, and the quantity.
- `employees`: Employees, each with an employee id, last name, first name, etc. It also has a field named ReportsTo to specify who reports to whom.
- `customers`: Customers.
- `media_types`: Media types, such as MPEG audio file, ACC audio file, etc.
- `genres`: Music types such as rock, jazz, metal, etc.
- `playlists` and `playlist_track`: Musical playlists. Each playlist contains a list of tracks. Each track may belong to multiple playlists. The relationship between the playlists table and tracks table is many-to-many. The playlist_track table is used to reflect this relationship.
 

A schematic view of this database is below:
    
![The chinook database](chinook.png)

Now let's connect to the database:

In [2]:
con = sqlite.connect('chinook.db')
con.row_factory = sqlite.Row
cur = con.cursor()

Now that we're connected, we can print out the contents of a table:

In [3]:
cur.execute("SELECT * FROM media_types")
[dict(row) for row in cur.fetchall()]

[{'MediaTypeId': 1, 'Name': 'MPEG audio file'},
 {'MediaTypeId': 2, 'Name': 'Protected AAC audio file'},
 {'MediaTypeId': 3, 'Name': 'Protected MPEG-4 video file'},
 {'MediaTypeId': 4, 'Name': 'Purchased AAC audio file'},
 {'MediaTypeId': 5, 'Name': 'AAC audio file'}]

### Part 0.2: List-of-dictionaries setup

Let's translate all the tables from the database into lists of dictionaries so that we can write code in list-of-dictionaries style.

For every table in the database, we'll create a variable with the suffix `_lod` in list-of-dictionaries format representing that table. For example, for `media_types`:

In [4]:
cur.execute("SELECT * FROM media_types")
media_types_lod = [dict(row) for row in cur.fetchall()]

media_types_lod

[{'MediaTypeId': 1, 'Name': 'MPEG audio file'},
 {'MediaTypeId': 2, 'Name': 'Protected AAC audio file'},
 {'MediaTypeId': 3, 'Name': 'Protected MPEG-4 video file'},
 {'MediaTypeId': 4, 'Name': 'Purchased AAC audio file'},
 {'MediaTypeId': 5, 'Name': 'AAC audio file'}]

Now the rest of them:

In [5]:
cur.execute("SELECT * FROM artists")
artists_lod = [dict(row) for row in cur.fetchall()]

cur.execute("SELECT * FROM albums")
albums_lod = [dict(row) for row in cur.fetchall()]

cur.execute("SELECT * FROM tracks")
tracks_lod = [dict(row) for row in cur.fetchall()]

cur.execute("SELECT * FROM invoices")
invoices_lod = [dict(row) for row in cur.fetchall()]

cur.execute("SELECT * FROM employees")
employees_lod = [dict(row) for row in cur.fetchall()]

cur.execute("SELECT * FROM customers")
customers_lod = [dict(row) for row in cur.fetchall()]

cur.execute("SELECT * FROM genres")
genres_lod = [dict(row) for row in cur.fetchall()]

cur.execute("SELECT * FROM playlists")
playlists_lod = [dict(row) for row in cur.fetchall()]

### Part 0.3: Pandas setup

Let's translate all the tables from lists of dictionaries into pandas dataframes so that we can write code in pandas style.

For every table, we'll create a variable with the suffix `_df` in pandas data frame format representing that table. For example, for `media_types`:

In [6]:
media_types_df = pd.DataFrame(media_types_lod)

media_types_df

Unnamed: 0,MediaTypeId,Name
0,1,MPEG audio file
1,2,Protected AAC audio file
2,3,Protected MPEG-4 video file
3,4,Purchased AAC audio file
4,5,AAC audio file


Now the rest of them:

In [7]:
artists_df = pd.DataFrame(artists_lod)
albums_df = pd.DataFrame(albums_lod)
tracks_df = pd.DataFrame(tracks_lod)
invoices_df = pd.DataFrame(invoices_lod)
employees_df = pd.DataFrame(employees_lod)
customers_df = pd.DataFrame(customers_lod)
genres_df = pd.DataFrame(genres_lod)
playlists_df = pd.DataFrame(playlists_lod)

## Part 1: Examples of expected responses

In this lab, we'll give you some code in one framework and name another framework you should translate it into one or more other frameworks so that equivalent output is returned. Something like this:

Given this code:

In [8]:
cur.execute("SELECT FirstName FROM employees")
[dict(row) for row in cur.fetchall()]

[{'FirstName': 'Andrew'},
 {'FirstName': 'Nancy'},
 {'FirstName': 'Jane'},
 {'FirstName': 'Margaret'},
 {'FirstName': 'Steve'},
 {'FirstName': 'Michael'},
 {'FirstName': 'Robert'},
 {'FirstName': 'Laura'}]

If we ask you to translate it into **pandas**, you would write something like this:

In [9]:
employees_df["FirstName"]

0      Andrew
1       Nancy
2        Jane
3    Margaret
4       Steve
5     Michael
6      Robert
7       Laura
Name: FirstName, dtype: object

Or if we had asked to translate that code into **list-of-dictionaries**, you could write something like:

In [10]:
[row["FirstName"] for row in employees_lod]

['Andrew', 'Nancy', 'Jane', 'Margaret', 'Steve', 'Michael', 'Robert', 'Laura']

## Part 2: Let's get translating!

### Q1

Given this code:

In [11]:
cur.execute("SELECT count(*) FROM artists")
[dict(row) for row in cur.fetchall()]

[{'count(*)': 275}]

Translate it into **pandas**:

In [12]:
#artists_df.count()
#or 
len(artists_df)

275

Translate it into **lod**:

In [13]:
len(artists_lod)

275

### Q2

Given this code:

In [14]:
[r["Name"] for r in tracks_lod if r["Composer"] == "Kurt Cobain"]

['Intro',
 'School',
 'Drain You',
 'Been A Son',
 'Lithium',
 'Sliver',
 'Spank Thru',
 'Heart-Shaped Box',
 'Milk It',
 'Negative Creep',
 'Polly',
 'Breed',
 "Tourette's",
 'Blew',
 'Smells Like Teen Spirit',
 'In Bloom',
 'Come As You Are',
 'Breed',
 'Lithium',
 'Polly',
 'Territorial Pissings',
 'Drain You',
 'Lounge Act',
 'Stay Away',
 'On A Plain',
 'Something In The Way']

Translate it into **pandas**:

In [15]:
tracks_df[(tracks_df["Composer"]== "Kurt Cobain")]["Name"].values


array(['Intro', 'School', 'Drain You', 'Been A Son', 'Lithium', 'Sliver',
       'Spank Thru', 'Heart-Shaped Box', 'Milk It', 'Negative Creep',
       'Polly', 'Breed', "Tourette's", 'Blew', 'Smells Like Teen Spirit',
       'In Bloom', 'Come As You Are', 'Breed', 'Lithium', 'Polly',
       'Territorial Pissings', 'Drain You', 'Lounge Act', 'Stay Away',
       'On A Plain', 'Something In The Way'], dtype=object)

Translate it into **SQL**:

In [16]:
statement = '''
    SELECT Name 
    FROM Tracks 
    WHERE Composer = 'Kurt Cobain'
'''

cur.execute(statement)
[dict(row) for row in cur.fetchall()]

[{'Name': 'Intro'},
 {'Name': 'School'},
 {'Name': 'Drain You'},
 {'Name': 'Been A Son'},
 {'Name': 'Lithium'},
 {'Name': 'Sliver'},
 {'Name': 'Spank Thru'},
 {'Name': 'Heart-Shaped Box'},
 {'Name': 'Milk It'},
 {'Name': 'Negative Creep'},
 {'Name': 'Polly'},
 {'Name': 'Breed'},
 {'Name': "Tourette's"},
 {'Name': 'Blew'},
 {'Name': 'Smells Like Teen Spirit'},
 {'Name': 'In Bloom'},
 {'Name': 'Come As You Are'},
 {'Name': 'Breed'},
 {'Name': 'Lithium'},
 {'Name': 'Polly'},
 {'Name': 'Territorial Pissings'},
 {'Name': 'Drain You'},
 {'Name': 'Lounge Act'},
 {'Name': 'Stay Away'},
 {'Name': 'On A Plain'},
 {'Name': 'Something In The Way'}]

### Q3

Given this code:

In [17]:
tracks_df["Composer"].value_counts().sort_values(ascending = False).head(10)

Steve Harris                                      80
U2                                                44
Jagger/Richards                                   35
Billy Corgan                                      31
Kurt Cobain                                       26
Bill Berry-Peter Buck-Mike Mills-Michael Stipe    25
The Tea Party                                     24
Chico Science                                     23
Chris Cornell                                     23
Miles Davis                                       23
Name: Composer, dtype: int64

Translate it into **SQL** (Note: the SQL solution will count NULL values, while the pandas count does not --- that is fine):

In [18]:
statement = '''
    SELECT Composer, count(*)
    FROM Tracks 
    WHERE Composer NOT LIKE 'None'
    GROUP BY Composer
    ORDER BY Count(*) DESC 
    LIMIT 10
'''

cur.execute(statement)
[dict(row) for row in cur.fetchall()]

[{'Composer': 'Steve Harris', 'count(*)': 80},
 {'Composer': 'U2', 'count(*)': 44},
 {'Composer': 'Jagger/Richards', 'count(*)': 35},
 {'Composer': 'Billy Corgan', 'count(*)': 31},
 {'Composer': 'Kurt Cobain', 'count(*)': 26},
 {'Composer': 'Bill Berry-Peter Buck-Mike Mills-Michael Stipe',
  'count(*)': 25},
 {'Composer': 'The Tea Party', 'count(*)': 24},
 {'Composer': 'Chico Science', 'count(*)': 23},
 {'Composer': 'Chris Cornell', 'count(*)': 23},
 {'Composer': 'Gilberto Gil', 'count(*)': 23}]

Translate it into **lod**:

In [19]:

d= defaultdict(int)
for x in tracks_lod: 
    if x["Composer"] != None:
        d[x["Composer"]] += 1

sorted(d.items(), key =lambda x: x[1], reverse = True)[:10]
  

[('Steve Harris', 80),
 ('U2', 44),
 ('Jagger/Richards', 35),
 ('Billy Corgan', 31),
 ('Kurt Cobain', 26),
 ('Bill Berry-Peter Buck-Mike Mills-Michael Stipe', 25),
 ('The Tea Party', 24),
 ('Gilberto Gil', 23),
 ('Chico Science', 23),
 ('Miles Davis', 23)]

### Q4

Given this code:

In [20]:
artists_df.Name[artists_df.Name.str.startswith("A")]

0                                                  AC/DC
1                                                 Accept
2                                              Aerosmith
3                                      Alanis Morissette
4                                        Alice In Chains
5                                   Antônio Carlos Jobim
6                                           Apocalyptica
7                                             Audioslave
25                                               Azymuth
42                                          A Cor Do Som
158                                              Aquaman
160          Aerosmith & Sierra Leone's Refugee Allstars
165                                        Avril Lavigne
196                                            Aisha Duo
201                                       Aaron Goldberg
205               Alberto Turco & Nova Schola Gregoriana
208    Anne-Sophie Mutter, Herbert Von Karajan & Wien...
213    Academy of St. Martin in

Translate it into **SQL** (Note: the SQL solution will count NULL values, while the pandas count does not --- that is fine):

In [21]:
statement = '''
    SELECT Name 
    FROM artists 
    WHERE Name LIKE 'A%'
'''
cur.execute(statement)
[dict(row) for row in cur.fetchall()]

[{'Name': 'AC/DC'},
 {'Name': 'Accept'},
 {'Name': 'Aerosmith'},
 {'Name': 'Alanis Morissette'},
 {'Name': 'Alice In Chains'},
 {'Name': 'Antônio Carlos Jobim'},
 {'Name': 'Apocalyptica'},
 {'Name': 'Audioslave'},
 {'Name': 'Azymuth'},
 {'Name': 'A Cor Do Som'},
 {'Name': 'Aquaman'},
 {'Name': "Aerosmith & Sierra Leone's Refugee Allstars"},
 {'Name': 'Avril Lavigne'},
 {'Name': 'Aisha Duo'},
 {'Name': 'Aaron Goldberg'},
 {'Name': 'Alberto Turco & Nova Schola Gregoriana'},
 {'Name': 'Anne-Sophie Mutter, Herbert Von Karajan & Wiener Philharmoniker'},
 {'Name': 'Academy of St. Martin in the Fields & Sir Neville Marriner'},
 {'Name': 'Academy of St. Martin in the Fields Chamber Ensemble & Sir Neville Marriner'},
 {'Name': 'Academy of St. Martin in the Fields, John Birch, Sir Neville Marriner & Sylvia McNair'},
 {'Name': 'Aaron Copland & London Symphony Orchestra'},
 {'Name': 'Academy of St. Martin in the Fields, Sir Neville Marriner & William Bennett'},
 {'Name': 'Antal Doráti & London Sym

Translate it into **lod**:

In [22]:
[x["Name"] for x in artists_lod if x["Name"][0] == "A"] 


['AC/DC',
 'Accept',
 'Aerosmith',
 'Alanis Morissette',
 'Alice In Chains',
 'Antônio Carlos Jobim',
 'Apocalyptica',
 'Audioslave',
 'Azymuth',
 'A Cor Do Som',
 'Aquaman',
 "Aerosmith & Sierra Leone's Refugee Allstars",
 'Avril Lavigne',
 'Aisha Duo',
 'Aaron Goldberg',
 'Alberto Turco & Nova Schola Gregoriana',
 'Anne-Sophie Mutter, Herbert Von Karajan & Wiener Philharmoniker',
 'Academy of St. Martin in the Fields & Sir Neville Marriner',
 'Academy of St. Martin in the Fields Chamber Ensemble & Sir Neville Marriner',
 'Academy of St. Martin in the Fields, John Birch, Sir Neville Marriner & Sylvia McNair',
 'Aaron Copland & London Symphony Orchestra',
 'Academy of St. Martin in the Fields, Sir Neville Marriner & William Bennett',
 'Antal Doráti & London Symphony Orchestra',
 'Amy Winehouse',
 'Academy of St. Martin in the Fields, Sir Neville Marriner & Thurston Dart',
 'Adrian Leaper & Doreen de Feis']

### Q5

Given this code:

In [23]:
artists_df\
    .merge(albums_df, on = "ArtistId")\
    .groupby(["ArtistId", "Name"])\
    ["Name"]\
    .count()\
    .sort_values(ascending = False)\
    .head(10)

ArtistId  Name           
90        Iron Maiden        21
22        Led Zeppelin       14
58        Deep Purple        11
150       U2                 10
50        Metallica          10
114       Ozzy Osbourne       6
118       Pearl Jam           5
82        Faith No More       4
84        Foo Fighters        4
21        Various Artists     4
Name: Name, dtype: int64

Translate it into **SQL**:

In [24]:
statement = '''
        SELECT artists.ArtistId, Name, COUNT(*)
        FROM artists 
        JOIN Albums On artists.ArtistId = albums.ArtistId
        GROUP BY artists.ArtistId
        ORDER BY Count(*) DESC
        LIMIT 10 
'''
cur.execute(statement)
[dict(row) for row in cur.fetchall()]

[{'ArtistId': 90, 'Name': 'Iron Maiden', 'COUNT(*)': 21},
 {'ArtistId': 22, 'Name': 'Led Zeppelin', 'COUNT(*)': 14},
 {'ArtistId': 58, 'Name': 'Deep Purple', 'COUNT(*)': 11},
 {'ArtistId': 50, 'Name': 'Metallica', 'COUNT(*)': 10},
 {'ArtistId': 150, 'Name': 'U2', 'COUNT(*)': 10},
 {'ArtistId': 114, 'Name': 'Ozzy Osbourne', 'COUNT(*)': 6},
 {'ArtistId': 118, 'Name': 'Pearl Jam', 'COUNT(*)': 5},
 {'ArtistId': 21, 'Name': 'Various Artists', 'COUNT(*)': 4},
 {'ArtistId': 82, 'Name': 'Faith No More', 'COUNT(*)': 4},
 {'ArtistId': 84, 'Name': 'Foo Fighters', 'COUNT(*)': 4}]

### BONUS

Translate **Q5** it into **lod**:

In [25]:
d =defaultdict(int)

for artists in artists_lod:
    for albums in albums_lod: 
        if artists["ArtistId"] == albums["ArtistId"]: 
            d[artists["Name"]]+=1
lis_tup =sorted(d.items(), key = lambda x: x[1], reverse= True)[:10]
answer = [] 
for name,count in lis_tup:
    for x in artists_lod:
        if x["Name"] == name: 
            dic = {} 
            dic["ArtistId"] = x['ArtistId']
            dic["Name"] = name 
            dic["Count"] = count
    answer.append(dic)
           
answer        


[{'ArtistId': 90, 'Name': 'Iron Maiden', 'Count': 21},
 {'ArtistId': 22, 'Name': 'Led Zeppelin', 'Count': 14},
 {'ArtistId': 58, 'Name': 'Deep Purple', 'Count': 11},
 {'ArtistId': 50, 'Name': 'Metallica', 'Count': 10},
 {'ArtistId': 150, 'Name': 'U2', 'Count': 10},
 {'ArtistId': 114, 'Name': 'Ozzy Osbourne', 'Count': 6},
 {'ArtistId': 118, 'Name': 'Pearl Jam', 'Count': 5},
 {'ArtistId': 21, 'Name': 'Various Artists', 'Count': 4},
 {'ArtistId': 82, 'Name': 'Faith No More', 'Count': 4},
 {'ArtistId': 84, 'Name': 'Foo Fighters', 'Count': 4}]