<a href="https://colab.research.google.com/github/richard-cartwright/personal/blob/master/EtoE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Packages

In [0]:
# Install package for reading Excels
!pip install xlrd

In [0]:
# Basic imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import xlrd

# Plots inline
%matplotlib inline

# Setting plotting styles
plt.style.use('fivethirtyeight')
sns.set_style('white')

# Displays all cell's output, not just last output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Environment

The default Python environment is within Google Colab (a Google server). I use Google Drive to store all my files. Therefore, I mount my GDrive to the environment, so I can now retrieve and store files directly with my GDrive.

**This environment can be changed**. If you are instead working on your local machine, just ensure you have all the required files and are pointing to the correct path.

Original data:
- streams per ISRC: ***ISRC_1m.txt***
- Apple Music international, whether each ISRC-UPC pair is available on Apple Music for each country: ***UMI_AppleMusicTracks.txt***
- DiGS Global Rights data: ***2015to18_digs_rights_allstatus.xlsx, 2010to14_digs_rights_allstatus.xlsx, before2010_digs_rights_allstatus.xlsx***
- all UPC_ISRC pairings (4.5mil): ***upcs_isrcs_digs_rights_allstatus.tsv***

Derived data:
- Boolean, long tables of the Rights data (UPCs): ***rights_2015to18_allcountries.csv, rights_2010to14_allcountries.csv, rights_before2010_allcountries.csv***
- Apple Music availability per UPC per country (inc number of ISRCs available per UPC): ***international_UPCs_available.csv***

In [0]:
# Mount GDrive to Colab environment

from google.colab import drive
drive.mount('/content/drive')

In [0]:
# This is so I can save files directly to my GDrive

# Install the PyDrive wrapper & import libraries
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
# Unzip zipped folders, does not need to be repeated

# !unzip '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/umgglobal_Music.zip' -d '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'
# !unzip '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/International.zip' -d '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'
# !unzip '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/upcs_isrcs_digs_rights_allstatus2.tsv.zip' -d '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'

# !unzip '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/formerly EMI and now GLOBAL.zip' -d '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'

In [0]:
# View files in folder
!ls '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'

# Create path for data
path = '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'

# Data

## Apple tracks data

### Load Apple availability data

In [0]:
# # DON'T NEED TO RERUN

# # All tracks on Apple Music

# # Select just non-wordy columns so can use only these cols for read_csv (saves memory)
# wordy_columns = ['Track_Vendor_id',
#                  'Track_Title ',
#                  'Primary_Artist_Name',
#                  'Track_Audio_Language',
#                  'Playlist_Adam_ID ']
# nonwordy_apple_columns = [col for col in pd.read_table(path+'UMI_AppleMusicTracks.txt',nrows=5).columns if col not in wordy_columns]

# # ---------------------------
# # INTERNATIONAL: Read in Apple International ISRCs info (nonwordy columns)
# international_apple_df = pd.read_table(path+'UMI_AppleMusicTracks.txt',
#                                        usecols=nonwordy_apple_columns,
#                                        dtype={'Track_adam_id':str,
#                                               'Playlist_UPC':str,
#                                               'int':'bool'},
#                                        low_memory=False)
# international_apple_df.info()
# # Converting columns into boolean (saves space)
# for country in [col for col in international_apple_df.columns if len(col)==2]:
#     international_apple_df[country] = international_apple_df[country].astype('bool')
# international_apple_df.info()

# # ---------------------------
# # GLOBAL: Read in Apple Global ISRCs info (nonwordy columns)
# global_international_apple_df = pd.read_table(path+'umgglobal_AppleMusicTracks.txt',
#                                               usecols=nonwordy_apple_columns,
#                                               dtype={'Track_adam_id':str,
#                                                      'Playlist_UPC':str,
#                                                      'int':'bool'},
#                                               low_memory=False)
# global_international_apple_df.info()
# # Converting columns into boolean (saves space)
# for country in [col for col in global_international_apple_df.columns if len(col)==2]:
#     global_international_apple_df[country] = global_international_apple_df[country].astype('bool')
# global_international_apple_df.info()

# # ---------------------------
# # Combine International & Global
# international_apple_df = pd.concat(
#     [international_apple_df,
#      global_international_apple_df],
#     ignore_index=True)
# del global_international_apple_df #saves RAM
# international_apple_df.columns = [col.strip() for col in international_apple_df.columns]
# international_apple_df.info()

# # ---------------------------
# # Fillna UPCs & ISRCs with 'unknown'
# international_apple_df['Track_isrc'] = international_apple_df['Track_isrc'].fillna('unknown')
# international_apple_df['Playlist_UPC'] = international_apple_df['Playlist_UPC'].fillna('unknown')

# # ---------------------------
# # Ensure only one row for each UPC-ISRC pair

# # Split into two to avoid running out of RAM
# UPCs_list = sorted(international_apple_df['Playlist_UPC'].unique())
# cutoff = round(len(UPCs_list)/2)
# first_half_UPCs = UPCs_list[:cutoff]
# second_half_UPCs = UPCs_list[cutoff:]

# # Groupby UPC-ISRC, concat the two halves
# international_apple_df = pd.concat(
#     [international_apple_df[international_apple_df['Playlist_UPC'].isin(first_half_UPCs)].groupby(['Playlist_UPC','Track_isrc']).sum().astype('bool').reset_index(),
#      international_apple_df[international_apple_df['Playlist_UPC'].isin(second_half_UPCs)].groupby(['Playlist_UPC','Track_isrc']).sum().astype('bool').reset_index()])

# international_apple_df.head(2)
# international_apple_df.info()

### Create reference tables for main Apple work 

- ***upcs_isrcs_ref_df***: UPC-ISRC pairs which are not present in Apple data but the UPCs are present in the final Rights data
- ***UPC_count_sumstreams_df***: num_ISRCs_perUPC & global_ISRCstreams_perUPC

In [0]:
# # DON'T NEED TO RERUN

# # Create list of UPC-ISRC pairs that are not present in the Apple data but where the UPCs are present in the rights data
# # This is used later to calculate an accurate num_ISRCs_available for each UPC

# # NOTE: dtype={'UPC': str} is VERY important as otherwise python reads in the UPCs as integers without the leading zeros

# #---------------------
# # Extract final UPCs which are represented in the final Rights data
# UPC_allyears_allcountries = pd.concat(
#     [pd.read_csv(path+'rights_2015to18_allcountries.csv',
#                  dtype={'UPC': str},usecols=['UPC']),
#      pd.read_csv(path+'rights_2010to14_allcountries.csv',
#                  dtype={'UPC': str},usecols=['UPC']),
#      pd.read_csv(path+'rights_before2010_allcountries.csv',
#                  dtype={'UPC': str},usecols=['UPC'])],
#     ignore_index=True)

# # This ensures no duplicate UPCs
# UPC_allyears_allcountries = UPC_allyears_allcountries.groupby('UPC').size().reset_index().drop(columns=[0])
# UPC_allyears_allcountries.head(2)
# UPC_allyears_allcountries.info()

# #---------------------
# # Apple ISRC-UPC pairs
# apple_isrc_upc_pairs = pd.concat(
#     [pd.read_table(path+'UMI_AppleMusicTracks.txt',
#                   usecols=['Track_isrc','Playlist_UPC '], #the space after _UPC is important
#                   dtype={'Playlist_UPC':str},
#                   low_memory=False),
#      pd.read_table(path+'umgglobal_AppleMusicTracks.txt',
#                   usecols=['Track_isrc','Playlist_UPC '], #the space after _UPC is important
#                   dtype={'Playlist_UPC':str},
#                   low_memory=False)],ignore_index=True)
# apple_isrc_upc_pairs.info()

# # Rid the column names of spaces before or after the colname
# apple_isrc_upc_pairs.columns = [col.strip() for col in apple_isrc_upc_pairs.columns]

# # Ensures no duplicate UPC-ISRC pairs
# apple_isrc_upc_pairs = apple_isrc_upc_pairs.groupby(['Playlist_UPC','Track_isrc'],as_index=False).size().reset_index().drop(columns=[0])

# # Give flag to these UPC-ISRC pairs as this will be used next to deselect these existing pairs
# apple_isrc_upc_pairs['apple_flag'] = True 
# apple_isrc_upc_pairs.info()
# apple_isrc_upc_pairs.head(2)

# #---------------------
# # ISRC-UPC pairs base data from Universal
# upcs_isrcs_ref_df = pd.read_table(path+'upcs_isrcs_digs_rights_allstatus.tsv',
#                                   dtype={0:'str'},
#                                   names=['UPC','ISRC'])
# upcs_isrcs_ref_df.info()

# # Select only UPCs which are in the final Rights data
# upcs_isrcs_ref_df = pd.merge(upcs_isrcs_ref_df,
#                              UPC_allyears_allcountries,
#                              how='right',
#                              on='UPC')
# upcs_isrcs_ref_df.info()
# del UPC_allyears_allcountries #saves space

# # Keep only UPC-ISRC pairs which are not already in the Apple data
# upcs_isrcs_ref_df = pd.merge(upcs_isrcs_ref_df,
#                              apple_isrc_upc_pairs,
#                              how='left',
#                              left_on=['UPC','ISRC'],
#                              right_on=['Playlist_UPC','Track_isrc'])
# upcs_isrcs_ref_df = upcs_isrcs_ref_df[upcs_isrcs_ref_df['apple_flag'].isnull()]
# upcs_isrcs_ref_df.info()
# del apple_isrc_upc_pairs #saves space

# # Select only the full UPC-ISRC columns, dropna covers the fewer ISRCs which are null
# upcs_isrcs_ref_df = upcs_isrcs_ref_df[['UPC','ISRC']].dropna()

# upcs_isrcs_ref_df.head(2)
# upcs_isrcs_ref_df.info()

In [0]:
# # DON'T NEED TO RERUN

# # Create table of global_streams_ofISRCs_onUPC as proxy for popularity

# # Number of streams per ISRC across all platforms - proxy for popularity
# isrc_streams_df = pd.read_csv(path+'ISRC_1m.txt')
# isrc_streams_df.head(2)
# isrc_streams_df.info()

# #---------------------
# # UPC-ISRC pairs base data from Universal
# UPC_count_sumstreams_df = pd.read_table(path+'upcs_isrcs_digs_rights_allstatus.tsv',
#                                         dtype={0:'str'},
#                                         names=['UPC','ISRC'])
# UPC_count_sumstreams_df.info()

# #---------------------
# # Merge ISRC streams onto UPC-ISRCs table, fill total_streams=0 if no matching ISRC available
# # Drop row if no streams for that ISRC
# UPC_count_sumstreams_df = pd.merge(UPC_count_sumstreams_df,
#                                    isrc_streams_df,
#                                    how='left',
#                                    left_on='ISRC',
#                                    right_on='isrc').drop(columns=['isrc']).fillna(0)
# del isrc_streams_df #saves space

# #---------------------
# # Groupby UPC to get num_ISRCs_per_UPC and global_streams_ofISRCs_onUPC
# UPC_count_sumstreams_df = UPC_count_sumstreams_df.groupby('UPC').agg({'UPC':'count',
#                                                                       'total_streams':'sum'})
# UPC_count_sumstreams_df.rename(columns={'UPC':'num_ISRCs_per_UPC',
#                                         'total_streams':'global_streams_ofISRCs_onUPC'},
#                                inplace=True)

# UPC_count_sumstreams_df.head(2)
# UPC_count_sumstreams_df.info()

### Create UPC-level table from ISRC-level Apple availability table

In [0]:
# # DON'T NEED TO RERUN

# # 1) Groupby ISRC to get availablity for each ISRC (across all UPCs) for each country
# # 2) Merge that with the UPC-ISRC pairs which are not in Apple Music - groupby UPC straight away to save space
# # 3) Merge the ISRC availability table with the original Apple table

# # 1) Groupby ISRC, so for bool ISRC_availablility for each ISRC for each country (true if ISRC available with any UPC)
# international_apple_ISRCs_df = international_apple_df.groupby('Track_isrc').sum().astype('bool')
# international_apple_ISRCs_df.columns = ['ISRC_available_in{}'.format(country) for country in international_apple_ISRCs_df.columns]
# international_apple_ISRCs_df.info()

# # #---------------------
# # 2) For the UPC-ISRC pairs which are not in Apple Music, merge on whether the ISRC is available in each country
# # Groupby UPC straight away to save space. The columns now equate to how many ISRCs are availble for each UPC for each country (for the UPC-ISRC pairs not in Apple Music)
# upcs_isrcs_ref_df = pd.merge(upcs_isrcs_ref_df,
#                              international_apple_ISRCs_df,
#                              how='left',
#                              left_on='ISRC',
#                              right_index=True).dropna().groupby('UPC').sum()
# upcs_isrcs_ref_df.info()

# # Drop ISRC column as no longer important
# upcs_isrcs_ref_df.drop(columns=['ISRC'],inplace=True)

# # Rename this column so in concats well below
# upcs_isrcs_ref_df = upcs_isrcs_ref_df.reset_index().rename(columns={'UPC':'Playlist_UPC'}).set_index('Playlist_UPC')
# upcs_isrcs_ref_df.info()
# upcs_isrcs_ref_df.head(2)

# # #---------------------
# # 3) Merge original Apple ISRC table with whether that ISRC is available for any UPC
# international_apple_ISRCs_df = pd.merge(international_apple_df,
#                                         international_apple_ISRCs_df,
#                                         how='left',
#                                         left_on='Track_isrc',
#                                         right_index=True)
# international_apple_ISRCs_df.head(2)
# international_apple_ISRCs_df.info()
# del international_apple_df #saves space

In [0]:
# # DON'T NEED TO RERUN

# # 1) Groupby UPC - for each UPC for each country: i) have a boolean indicator of whether UPC is available; ii) integer of many ISRCs of available for that UPC
# # 2) For each country: rename columns, make UPC_available_in columns boolean
# # 3) Concat the Apple-derived and non-Apple-derived tables
# # 4) Groupby UPC to solve the multi-UPC issue
# # 5) Merge in num_ISRCs_per_UPC & global_streams_ofISRCs_onUPC

# # #---------------------
# # 1) Groupby UPC - for each UPC for each country: i) have a boolean indicator of whether UPC is available; ii) integer of many ISRCs of available for that UPC
# international_UPCs_available_df = international_apple_ISRCs_df.groupby('Playlist_UPC').sum()
# international_UPCs_available_df.info()
# del international_apple_ISRCs_df #saves space

# # #---------------------
# # 2) For each country: rename columns, make UPC_available_in columns boolean
# for country in [col for col in international_UPCs_available_df.columns if len(col)==2]:
#     international_UPCs_available_df.rename(columns={country:'UPC_available_in{}'.format(country),
#                                                     'ISRC_available_in{}'.format(country):'num_ISRCs_available_{}'.format(country)},
#                                            inplace=True)
#     # Make bool columns - if ISRC present for any UPC, then col = True
#     international_UPCs_available_df['UPC_available_in{}'.format(country)] = international_UPCs_available_df['UPC_available_in{}'.format(country)].astype('bool')
    
#     upcs_isrcs_ref_df.rename(columns={'ISRC_available_in{}'.format(country):'num_ISRCs_available_{}'.format(country)},
#                              inplace=True)
# upcs_isrcs_ref_df.head(2)

# # #---------------------
# # international_UPCs_available_df: is Apple-derived, for each UPC for each country, bool UPC_available & integer num_ISRCs_available_
# # upcs_isrcs_ref_df: is non-Apple-derived, for each UPC for each country, & integer num_ISRCs_available_ (no bool UPC_available as none of the UPC_ISRC pairs are available by design)

# # 3) Concat the tables over each other, matching the num_ISRCs_available_ columns, and filling as false the UPC_available columns
# # There are repeated UPCs, which is fixed later by the groupby
# international_UPCs_available_df = pd.concat(
#     [international_UPCs_available_df.reset_index(),
#      upcs_isrcs_ref_df.reset_index()],
#     ignore_index=True).fillna(False)
# del upcs_isrcs_ref_df #saves space
# international_UPCs_available_df.head(2)
# international_UPCs_available_df.info()

# # Converts the num_ISRCs_available_ columns into integers so they can be summed in the groupby
# for country in [col[-2:] for col in international_UPCs_available_df.columns if 'ISRCs' in col]:
#     international_UPCs_available_df['num_ISRCs_available_{}'.format(country)] = international_UPCs_available_df['num_ISRCs_available_{}'.format(country)].astype('int')
# international_UPCs_available_df.info()

# # #---------------------
# # 4) This solves the multi-UPC issue- groupby UPC & sum
# # All UPC_available columns are False for upcs_isrcs_ref_df UPCs, so these UPCs are unaffected.
# # The num_ISRCs_available_ columns sum to the true ISRC availability per UPC, even for UPCs not represented in the Apple table
# international_UPCs_available_df = international_UPCs_available_df.groupby('Playlist_UPC').sum()
# international_UPCs_available_df.info()
# international_UPCs_available_df.head(2)

# # #---------------------
# # 5) Merge in num_ISRCs_per_UPC & global_streams_ofISRCs_onUPC
# international_UPCs_available_df = pd.merge(international_UPCs_available_df,
#                                            UPC_count_sumstreams_df,
#                                            how='left',
#                                            left_index=True,
#                                            right_index=True)
# del UPC_count_sumstreams_df #saves space
# international_UPCs_available_df.head(2)
# international_UPCs_available_df.info()

In [0]:
# # DON'T NEED TO RERUN

# # Create & upload file to Google Drive - international_UPCs_df

# international_UPCs_available_df.to_csv('international_UPCs_available.csv', index=True)
# uploaded = drive.CreateFile({'title': 'international_UPCs_available.csv'})
# uploaded.SetContentFile('international_UPCs_available.csv')
# uploaded.Upload()
# print('Uploaded file with ID {}'.format(uploaded.get('id')))

### Reupload Apple reference table

In [0]:
# Reupload availability table
international_UPCs_available_df = pd.read_csv(path+'international_UPCs_available (1).csv',
                                              dtype={'UPC':'str'},
                                              low_memory=False
                                             ).set_index('Playlist_UPC')

# international_UPCs_available_df = pd.read_csv(path+'international_UPCs_available.csv',
#                                               dtype={'UPC':'str'},
#                                               low_memory=False
#                                              ).set_index('Playlist_UPC')
international_UPCs_available_df.head(2)
international_UPCs_available_df.info()

## Rights data

### Create wide Rights data of boolean columns

Only to be run once. Takes ~4hrs.

In [0]:
# # DOES NOT NEED TO BE RERUN

# # This code takes the Rights data for all UPCs
# # At the beginning there is, for each UPC, a string of two-letter country codes with rights for each of Legal / Marketing / Optin
# # This script creates bollean columns for each country for each right, of whether that UPC has that right

# # This code takes a looong time to run, ~4hrs for the entire for loop
# # It is separated by various For loops in order to stay below the RAM limit (11GB)
# # This only needs to be done ONCE. It saves each year_range output to GDrive, which are then reuploaded in the next section.

# # Within the For loop:
# # 1) Read in the Rights data
# # 2) Keep only albums & single (not ringtones & videos)
# # 3) Split string by comma into a list of all countries
# # 4) Loop to create boolean columns of whether the UPC has the right
# # 5) Create & upload file to Google Drive

# # ----------------
# # Separated into 3 manageable sections due to large table size
# rights_df_dict = {}
# year_ranges = ['2015to18','2010to14','before2010']
# for year in year_ranges:

#     # 1) Legal & marketing rights, and opt-in, for each UPC
#     rights_df_dict[year] = pd.read_excel(path+'{}_digs_rights_allstatus.xlsx'.format(year),
#                               dtype={'UPC': str}
#                              ).set_index('UPC')
    
#     # Strip leading & lagging space from column names 
#     rights_df_dict[year].columns = [col.strip() for col in rights_df_dict[year].columns]
#     rights_df_dict[year].info()
    
#     # ----------------
#     # Lower case to ensure no issues with case non-matching
#     rights_df_dict[year]['Product Configuration'] = rights_df_dict[year]['Product Configuration'].apply(lambda s: s.lower())
    
#     # 2) Keep only albums & single (not ringtones & videos)
#     rights_df_dict[year]['single'] = rights_df_dict[year]['Product Configuration'].apply(lambda x: 
#                                                                    True if 'single' in x.lower() 
#                                                                    and 'video' not in x.lower()
#                                                                    and 'album' not in x.lower()
#                                                                    else False)
#     rights_df_dict[year]['album'] = rights_df_dict[year]['Product Configuration'].apply(lambda x: 
#                                                                   True if ('album' in x.lower() 
#                                                                            or 'bundle' in x.lower())
#                                                                   and 'video' not in x.lower()
#                                                                   and 'single' not in x.lower()
#                                                                   else False)
#     rights_df_dict[year] = rights_df_dict[year][rights_df_dict[year]['single'] | rights_df_dict[year]['album']]
#     rights_df_dict[year].info()
    
#     # ----------------
#     # Keep only relevant columns
#     rights_df_dict[year] = rights_df_dict[year][['Legal Territories List','Marketing Territories List','Opt In List']]

#     # 3) Making each string of countries a list within each cell for each UPC
#     rights_df_dict[year].replace('-', 'none', inplace=True)
#     for col in rights_df_dict[year].columns:
#         rights_df_dict[year][col] = rights_df_dict[year][col].apply(lambda x: x.split(','))
#     rights_df_dict[year].head(2)
#     rights_df_dict[year].info()
    
#     # ----------------
#     # Creates lists of all countries, and country_right interactions
#     list_of_countries = list(rights_df_dict[year]['Legal Territories List'][rights_df_dict[year]['Legal Territories List'].apply(lambda x: len(x)).idxmax()])
#     print(list_of_countries)
    
    
#     # To save memory: split into three separate for loops so I can drop the reference col each time
#     # 4) Loop to create boolean columns of whether the UPC has the right
    
#     # Legal rights
#     for country in list_of_countries+['none']:
#         print(country, '{} legal starting'.format(year))
#         rights_df_dict[year]['{}_legal'.format(country)] = rights_df_dict[year].apply(lambda row: 
#                                                                                       True if country in row['Legal Territories List'] 
#                                                                                       else False,
#                                                                                       axis=1)
#         print(country, '{} legal ended'.format(year))   
#     rights_df_dict[year].drop(columns=['Legal Territories List'],
#                    inplace=True)

#     # Marketing rights
#     for country in list_of_countries+['none']:
#         print(country, '{} marketing starting'.format(year))
#         rights_df_dict[year]['{}_marketing'.format(country)] = rights_df_dict[year].apply(lambda row: 
#                                                                                           True if country in row['Marketing Territories List'] 
#                                                                                           else False,
#                                                                                           axis=1)
#         print(country, '{} marketing ended'.format(year))
#     rights_df_dict[year].drop(columns=['Marketing Territories List'],
#                    inplace=True)

#     # Opt-in
#     for country in list_of_countries+['none']:
#         print(country, '{} optin starting'.format(year))
#         rights_df_dict[year]['{}_optin'.format(country)] = rights_df_dict[year].apply(lambda row: 
#                                                                                       True if country in row['Opt In List'] 
#                                                                                       else False,
#                                                                                       axis=1)
#         print(country, '{} optin ended'.format(year))
#     rights_df_dict[year].drop(columns=['Opt In List'],
#                    inplace=True)
    
#     rights_df_dict[year].info()
    
#     # ----------------
#     # 5) Create & upload file to Google Drive
#     rights_df_dict[year].to_csv('rights_{}_allcountries.csv'.format(year), index=True)
#     uploaded = drive.CreateFile({'title': 'rights_{}_allcountries.csv'.format(year)})
#     uploaded.SetContentFile('rights_{}_allcountries.csv'.format(year))
#     uploaded.Upload()
#     print('Uploaded file with ID {}'.format(uploaded.get('id')))
    
#     del rights_df_dict[year] #saves RAM

### Reupload wide Rights data of boolean columns

In [0]:
# Reupload & concat all Rights information

# Some year_ranges have columns which are not present for all year_ranges. This is if a country code has changed etc. These columns are fillna'd as False
reuploaded_rights_all_countries_df = pd.concat(
    [pd.read_csv(path+'rights_2015to18_allcountries.csv',
                 dtype={'UPC': str}),
     pd.read_csv(path+'rights_2010to14_allcountries.csv',
                 dtype={'UPC': str}),
     pd.read_csv(path+'rights_before2010_allcountries.csv',
                 dtype={'UPC': str})],
    ignore_index=True).set_index('UPC').fillna('False').astype('bool').reset_index().dropna() #drops NaN UPCs
reuploaded_rights_all_countries_df.head(2)
reuploaded_rights_all_countries_df.info()

# ---------------------------
# Ensure only one row for each UPC-ISRC pair

# Split into two to avoid running out of RAM
UPCs_list = sorted(reuploaded_rights_all_countries_df['UPC'].unique())
cutoff = round(len(UPCs_list)/2)
first_half_UPCs = UPCs_list[:cutoff]
second_half_UPCs = UPCs_list[cutoff:]

# Groupby UPC-ISRC, concat the two halves
reuploaded_rights_all_countries_df = pd.concat(
    [reuploaded_rights_all_countries_df[reuploaded_rights_all_countries_df['UPC'].isin(first_half_UPCs)].groupby('UPC').sum().astype('bool'),
     reuploaded_rights_all_countries_df[reuploaded_rights_all_countries_df['UPC'].isin(second_half_UPCs)].groupby('UPC').sum().astype('bool')])

reuploaded_rights_all_countries_df.info()

### Rights wordy info

Create table with all wordy information (Artist,Title,Label etc) for each UPC (from original Rights spreadsheets)

In [0]:
# Creates table of wordyinfo for each UPC: Artist / Title / ReleaseDate / Label etc

# Legal & marketing rights, and opt-in, for each UPC
# There exist duplicate UPCs but this is corrected at the end because it takes too long without first removing the uneeded UPCs
rights_wordyinfo_df = pd.concat(
    [pd.read_excel(path+'2015to18_digs_rights_allstatus.xlsx',
                   dtype={'UPC': str}),
     pd.read_excel(path+'2010to14_digs_rights_allstatus.xlsx',
                   dtype={'UPC': str}),
     pd.read_excel(path+'before2010_digs_rights_allstatus.xlsx',
                   dtype={'UPC': str})],
    ignore_index=True)
rights_wordyinfo_df.columns = [col.strip() for col in rights_wordyinfo_df.columns]
rights_wordyinfo_df.drop(columns=['Opt In List','Marketing Territories List','Legal Territories List'], #these are the non-wordy columns
                         inplace=True)
rights_wordyinfo_df.info()

# ------------------------
# Make everything lower case to avoid case mismatch
rights_wordyinfo_df = rights_wordyinfo_df.applymap(lambda s: s.lower() if type(s) == str else s)

# Create boolean columns for each of Single and Album
rights_wordyinfo_df['single'] = rights_wordyinfo_df['Product Configuration'].apply(lambda x: 
                                                                                   True if 'single' in x.lower() 
                                                                                   and 'video' not in x.lower()
                                                                                   and 'album' not in x.lower()
                                                                                   else False)
rights_wordyinfo_df['album'] = rights_wordyinfo_df['Product Configuration'].apply(lambda x: 
                                                                                  True if ('album' in x.lower() 
                                                                                           or 'bundle' in x.lower())
                                                                                  and 'video' not in x.lower()
                                                                                  and 'single' not in x.lower()
                                                                                  else False)
# Keep only Singles & Albums (not ringtones or videos)
rights_wordyinfo_df = rights_wordyinfo_df[rights_wordyinfo_df['single'] | rights_wordyinfo_df['album']]
rights_wordyinfo_df.info()

# Remove duplicate UPCs (.first() takes an annoying amount of time)
rights_wordyinfo_df = rights_wordyinfo_df.groupby('UPC').first()
rights_wordyinfo_df.info()

# ------------------------
# Fill the few empty Artist & Titles with 'unknown'
rights_wordyinfo_df['Artist'] = rights_wordyinfo_df['Artist'].fillna('unknown')
rights_wordyinfo_df['Title'] = rights_wordyinfo_df['Title'].fillna('unknown')

# Create boolean flag of released before 2018
rights_wordyinfo_df['release_before_2018'] = rights_wordyinfo_df['Release Date'].apply(lambda x: 
                                                                                       True if x.year < 2018 
                                                                                       else False)

# Create binary column from Status, 'existing' or 'voided' (voided=cancelled,deleted etc)
rights_wordyinfo_df['binary_status'] = rights_wordyinfo_df['Status'].apply(lambda x: 
                                                                           'existing' if x in ['delivered','scheduled','final'] 
                                                                           else 'voided')

rights_wordyinfo_df.info()
rights_wordyinfo_df.head(2)

# ------------------------
# Add on bool column of whether they have duplicate Artist-Title
artist_title_duplicates_df = pd.DataFrame(
    rights_wordyinfo_df.groupby(['Artist','Title']).size()
).rename(columns={0:'duplicate_artist_title'})
artist_title_duplicates_df['duplicate_artist_title'] = artist_title_duplicates_df['duplicate_artist_title'] > 1

rights_wordyinfo_df = pd.merge(rights_wordyinfo_df.reset_index(),
                               artist_title_duplicates_df.reset_index(),
                               how='left',
                               on=['Artist','Title']
                              ).set_index('UPC')
rights_wordyinfo_df.info()
del artist_title_duplicates_df #saves RAM

# ------------------------
# Create core_columns for per-country table
core_columns = list(rights_wordyinfo_df.columns)

# Base UPC table for all countries

In [0]:
# Using the reuploaded Rights data as the base, merge on Apple availablity information and Rights wordy info
# This creates the monster reference table for all countries

# Merge Rights data & derived Apple availability data, for each UPC for each country
international_UPCs_df = pd.merge(reuploaded_rights_all_countries_df.reset_index(),
                                 international_UPCs_available_df,
                                 how='left',
                                 left_on='UPC',
                                 right_index=True).set_index('UPC')
international_UPCs_df.info()
international_UPCs_df.head(2)
del reuploaded_rights_all_countries_df #save space
del international_UPCs_available_df #save space

# ------------------------
# Merge wordy info
international_UPCs_df = pd.merge(international_UPCs_df,
                                 rights_wordyinfo_df,
                                 how='left',
                                 left_index=True,
                                 right_index=True)
international_UPCs_df.info()
international_UPCs_df.head(2)
del rights_wordyinfo_df #saves RAM

# ------------------------
# Create list of all country codes from apple data 
international_columns_list = [col for col in pd.read_csv(path+'UMI_AppleMusicTracks.txt',sep='\t',nrows=5).columns if len(col)==2]

# Create proportion of UPC already on store
for country in international_columns_list:
    international_UPCs_df['propISRCs_onstore_{}'.format(country)] = international_UPCs_df['num_ISRCs_available_{}'.format(country)] / international_UPCs_df['Track Count']
    
# ------------------------
# Fill nans
international_UPCs_df['global_streams_ofISRCs_onUPC'] = international_UPCs_df['global_streams_ofISRCs_onUPC'].fillna(0)
for country in international_columns_list:
    international_UPCs_df['num_ISRCs_available_{}'.format(country)] = international_UPCs_df['num_ISRCs_available_{}'.format(country)].fillna(0)
    international_UPCs_df['propISRCs_onstore_{}'.format(country)] = international_UPCs_df['propISRCs_onstore_{}'.format(country)].fillna(0) #happens if Track Count is empty
    international_UPCs_df['UPC_available_in{}'.format(country)] = international_UPCs_df['UPC_available_in{}'.format(country)].fillna(False)

# ------------------------
international_UPCs_df.head(2)
international_UPCs_df.info()

In [0]:
# # Create & upload file to Google Drive - international_UPCs_df

# international_UPCs_df.to_csv('international_UPCs_base.csv', index=True)
# !zip ZIP_international_UPCs_base.zip international_UPCs_base.csv
# uploaded = drive.CreateFile({'title': 'ZIP_international_UPCs_base.zip'})
# uploaded.SetContentFile('ZIP_international_UPCs_base.zip')
# uploaded.Upload()
# print('Uploaded file with ID {}'.format(uploaded.get('id')))
# !mv ZIP_international_UPCs_base.zip '/content/drive/My Drive/Personal/Colab Notebooks/Universal/'

# Outputs for GB

In [0]:
# defined from rights_wordyinfo_df (end of code that creates that table)
# may need to define core_columns if don't run rights_wordyinfo_df code
core_columns = core_columns + ['num_ISRCs_per_UPC','global_streams_ofISRCs_onUPC']

# Create empty dict for 
country_data = {}

# # ------------------------
# # Can upload UPC_base direct from saved, rather than creating UPC_base
# # Not really recommended, but quicker
# international_UPCs_df = pd.read_csv(path+'20181211_international_UPCs_base.csv',
#                                     dtype={'UPC': str},
#                                     low_memory=False).set_index('UPC')
# international_UPCs_df.info()
# international_UPCs_df.head(2)

In [0]:
# Get just GB-relevant columns for all UPCs
# And create GB-specific numbers matrix

country_data['gb'] = international_UPCs_df[[col for col in international_UPCs_df.columns if 'GB' in col]+core_columns].copy()
country_data['gb'].info()
country_data['gb'].head(2)

# ------------------------
# Group by all necessary columns to create matrix figures
GB_UPCs_matrix = pd.DataFrame(
    international_UPCs_df.groupby(
        ['GB_legal',
         'GB_marketing',
         'GB_optin',
         'UPC_available_inGB',
         'album',
         'single',
         'duplicate_artist_title',
         'release_before_2018',
         'binary_status']
    ).size()).rename(columns={0:'size'}).reset_index()

# Round for nicer figures
GB_UPCs_matrix['size'] = GB_UPCs_matrix['size'].apply(lambda x: round(x,-2))

# ------------------------
# Create 'category' column for legal-marketing-optin-available matrix 
def create_category(row):
    if not row['GB_legal']:
        return 'no_legal'
    elif (row['GB_legal'] and not row['GB_marketing']):
        return 'legal_nomarketing'
    elif (row['GB_legal'] and row['GB_marketing'] and not row['GB_optin']):
        return 'legal_marketing_no_optin'
    elif (row['GB_legal'] and row['GB_marketing'] and row['GB_optin'] and not row['UPC_available_inGB'] and row['release_before_2018']):
        return 'legal_marketing_optin_notavailable_before2018'
    elif (row['GB_legal'] and row['GB_marketing'] and row['GB_optin'] and not row['UPC_available_inGB'] and not row['release_before_2018']):
        return 'legal_marketing_optin_notavailable_2018onwards'
    else:
        return 'all_in'
GB_UPCs_matrix['category'] = GB_UPCs_matrix.apply(create_category,axis=1)

# ------------------------
# Pivot to nice table format
GB_UPCs_matrix = pd.pivot_table(GB_UPCs_matrix,
                                values='size',
                                index=['album','duplicate_artist_title','binary_status'],
                                columns=['category'],
                                aggfunc='sum')
# Sum each row
GB_UPCs_matrix['sum'] = GB_UPCs_matrix.sum(axis=1)

# Reorder columns, from all_in to no_legal
GB_UPCs_matrix = GB_UPCs_matrix[['all_in',
                                 'legal_marketing_optin_notavailable_2018onwards',
                                 'legal_marketing_optin_notavailable_before2018',
                                 'legal_marketing_no_optin',
                                 'legal_nomarketing',
                                 'no_legal',
                                 'sum']]

# Show matrix, and show sum for each column in matrix
GB_UPCs_matrix
GB_UPCs_matrix.sum()

# # ------------------------
# # Create & upload file to Google Drive - GB_UPC_data.csv
# country_data['gb'].to_csv('GB_UPC_data.csv')
# uploaded = drive.CreateFile({'title': 'GB_UPC_data.csv'})
# uploaded.SetContentFile('GB_UPC_data.csv')
# uploaded.Upload()
# print('Uploaded file with ID {}'.format(uploaded.get('id')))

# # ------------------------
# # Create & upload file to Google Drive - GB_aggregatenumbers.csv
# GB_UPCs_matrix.to_csv('GB_aggregatenumbers.csv')
# uploaded = drive.CreateFile({'title': 'GB_aggregatenumbers.csv'})
# uploaded.SetContentFile('GB_aggregatenumbers.csv')
# uploaded.Upload()
# print('Uploaded file with ID {}'.format(uploaded.get('id')))

# Testing