<a href="https://colab.research.google.com/github/richard-cartwright/personal/blob/master/EtoE_Looping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Packages

In [0]:
# Install package for reading Excels
!pip install xlrd

In [0]:
# Basic imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import xlrd

# Plots inline
%matplotlib inline

# Setting plotting styles
plt.style.use('fivethirtyeight')
sns.set_style('white')

# Displays all cell's output, not just last output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Summary
This is based on the main E2E file, and is used for the 'looping' process - where UPCs are made to be live and number of live ISRCs adjusts accordingly.

# Environment

**ALL THESE FILES REQUIRED**

The default Python environment is within Google Colab (a Google server). I use Google Drive to store all my files. Therefore, I mount my GDrive to the environment, so I can now retrieve and store files directly with my GDrive.

**This environment can be changed**. If you are instead working on your local machine, just ensure you have all the required files and are pointing to the correct path.

Original data:
- streams per ISRC: ***ISRC_1m.txt***
- Apple Music International, whether each ISRC-UPC pair is available on Apple Music for each country: ***UMI_AppleMusicTracks.txt***
- Apple Music Global, whether each ISRC-UPC pair is available on Apple Music for each country: ***umgglobal_AppleMusicTracks.txt***
- [don't need] DiGS Global Rights data: ***2015to18_digs_rights_allstatus.xlsx, 2010to14_digs_rights_allstatus.xlsx, before2010_digs_rights_allstatus.xlsx***
- all UPC_ISRC pairings (4.5mil): ***upcs_isrcs_digs_rights_allstatus.tsv***

Derived data:
- Boolean, long tables of the Rights data (UPCs): ***rights_2015to18_allcountries.csv, rights_2010to14_allcountries.csv, rights_before2010_allcountries.csv***
- Wordy columns, with derived variables, for full rights data: ***rights_wordyinfo.csv***
- csv of UPCs to make live: ***UPCs_to_make_live.csv***

In [0]:
# Mount GDrive to Colab environment

from google.colab import drive
drive.mount('/content/drive')

In [0]:
# # This is so I can save files directly to my GDrive

# # Install the PyDrive wrapper & import libraries
# !pip install -U -q PyDrive
# from pydrive.auth import GoogleAuth
# from pydrive.drive import GoogleDrive
# from google.colab import auth
# from oauth2client.client import GoogleCredentials

# # Authenticate and create the PyDrive client.
# auth.authenticate_user()
# gauth = GoogleAuth()
# gauth.credentials = GoogleCredentials.get_application_default()
# drive = GoogleDrive(gauth)

In [0]:
# Unzip zipped folders, does not need to be repeated

# !unzip '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/umgglobal_Music.zip' -d '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'
# !unzip '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/International.zip' -d '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'
# !unzip '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/upcs_isrcs_digs_rights_allstatus2.tsv.zip' -d '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'

# !unzip '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/formerly EMI and now GLOBAL.zip' -d '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'

In [0]:
# View files in folder
!ls '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'

# Create path for data
path = '/content/drive/My Drive/Personal/Colab Notebooks/Universal/Data/'

# Define your country

In [0]:
# # Displays list of two-letter country codes to select from

sorted([col for col in pd.read_table(path+'UMI_AppleMusicTracks.txt',nrows=5).columns if len(col)==2])

In [0]:
# Select the country code you want
# Choose only one

country_code = 'AU'

country_col_dict = {'legal':'{}_legal'.format(country_code),
                    'marketing':'{}_marketing'.format(country_code),
                    'optin':'{}_optin'.format(country_code),
                    'available':'UPC_available_in{}'.format(country_code),
                    'category':'category_{}'.format(country_code),
                    'prop_onstore':'propISRCs_onstore_{}'.format(country_code)}

# Define the UPCs to make live

In [0]:
# Define the df of UPCs to make live

# When there's no other df so code still runs
UPCs_to_make_live_df = pd.DataFrame({'UPC':['01','02','03']})

# # BE CAREFUL WITH THIS
# # Currently this is the 9945 UPCs for GB which are: albums, not duplicates, released before 2018, have legal, have marketing, not opted in, prop_onstore<0.8
# UPCs_to_make_live_df = pd.read_csv(path+'UPCs_to_make_live.csv',dtype='str')

# This is used in future joins
UPCs_to_make_live_df['live_flag'] = True

UPCs_to_make_live_df.head(2)
UPCs_to_make_live_df.info()

# Data

## Apple tracks data

### Load Apple availability data

In [0]:
# All tracks on Apple Music

# ---------------------------
# INTERNATIONAL: Read in Apple International ISRCs info (only the relevant country)
international_apple_df = pd.read_table(path+'UMI_AppleMusicTracks.txt',
                                       usecols=['Track_adam_id',
                                                'Track_isrc',
                                                'Playlist_UPC ',
                                                country_code], #only the relevant country
                                       dtype={'Track_adam_id':str,
                                              'Playlist_UPC':str,
                                              country_code:'bool'},
                                       low_memory=False)
international_apple_df.columns = [col.strip() for col in international_apple_df.columns]
international_apple_df.info()

# ---------------------------
# GLOBAL: Read in Apple Global ISRCs info (only the relevant country)
global_apple_df = pd.read_table(path+'umgglobal_AppleMusicTracks.txt',
                                usecols=['Track_adam_id',
                                         'Track_isrc',
                                         'Playlist_UPC ',
                                         country_code], #only the relevant country
                                dtype={'Track_adam_id':str,
                                       'Playlist_UPC':str,
                                       country_code:'bool'},
                                low_memory=False)
global_apple_df.columns = [col.strip() for col in global_apple_df.columns]
global_apple_df.info()

# ---------------------------
# Combine International & Global
international_apple_df = pd.concat(
    [international_apple_df,
     global_apple_df],
    ignore_index=True)

# Rename columns
international_apple_df.rename(columns={'Playlist_UPC':'UPC',
                                       'Track_isrc':'ISRC'},
                              inplace=True)

# Fillna UPCs & ISRCs with 'unknown'
international_apple_df['UPC'] = international_apple_df['UPC'].fillna('unknown')
international_apple_df['ISRC'] = international_apple_df['ISRC'].fillna('unknown')

international_apple_df.info()

#---------------------
# ISRC-UPC pairs base data from Universal
upcs_isrcs_ref_df = pd.read_table(path+'upcs_isrcs_digs_rights_allstatus.tsv',
                                  dtype={0:'str'},
                                  names=['UPC','ISRC'])
upcs_isrcs_ref_df.info()

# Combine base data & Apple data
international_apple_df = pd.merge(international_apple_df,
                                  upcs_isrcs_ref_df,
                                  how='outer',
                                  on=['UPC','ISRC']).fillna(False)
international_apple_df.info()

# ---------------------------
# Ensure only one row for each UPC-ISRC pair

# Groupby UPC-ISRC
international_apple_df = international_apple_df.groupby(['UPC','ISRC']).sum().astype('bool').reset_index()
international_apple_df.info()

In [0]:
# 

# Incorporate UPCs_to_make_live
adjusted_apple_df = international_apple_df.copy()
print(adjusted_apple_df[country_code].sum(),'\n')

# Left join with existing df to incorpoate UPCs_to_make_live (flag used as indicator)
adjusted_apple_df = pd.merge(adjusted_apple_df,
                             UPCs_to_make_live_df,
                             how='left',
                             on='UPC').fillna({'live_flag':False})
adjusted_apple_df.info()

# If live_flag False, use existing column. If True, make column=True
adjusted_apple_df[country_code] = adjusted_apple_df.apply(lambda row: row[country_code] 
                                                          if not row['live_flag'] 
                                                          else True,
                                                          axis=1).astype('bool')
adjusted_apple_df.drop(columns=['live_flag'],inplace=True)

#This sum tests that the UPCs have been incorporated as this should increase
print('\n',adjusted_apple_df[country_code].sum(),'\n')
adjusted_apple_df.info()

# ---------------------------
# Add in bool ISRC_availablility for each ISRC

# Groupby ISRC, so for bool ISRC_availablility for each ISRC (true if ISRC available with any UPC)
international_apple_ISRCs_df = adjusted_apple_df.groupby('ISRC').agg({country_code:'sum'}).astype('bool')
international_apple_ISRCs_df.rename(columns={country_code:'ISRC_available_in{}'.format(country_code)},
                                    inplace=True)
international_apple_ISRCs_df.info()

# Merge original Apple ISRC table with whether that ISRC is available for any UPC
adjusted_apple_df = pd.merge(adjusted_apple_df,
                             international_apple_ISRCs_df,
                             how='left',
                             left_on='ISRC',
                             right_index=True)
adjusted_apple_df.info()

#---------------------
# Add in ISRC streams as proxy for popularity

# Number of streams per ISRC across all platforms - proxy for popularity
isrc_streams_df = pd.read_csv(path+'ISRC_1m.txt')
isrc_streams_df.rename(columns={'isrc':'ISRC'},
                       inplace=True)
isrc_streams_df.info()

# Merge ISRC streams onto UPC-ISRCs table, fill total_streams=0 if no matching ISRC available
adjusted_apple_df = pd.merge(adjusted_apple_df,
                             isrc_streams_df,
                             how='left',
                             on='ISRC')
adjusted_apple_df.info()
adjusted_apple_df.head(2)

### Create UPC-level table from ISRC-level Apple availability table

In [0]:
# Groupby UPC - for each UPC for the selected country_code:
# i) have a boolean indicator of whether UPC is available; 
# ii) number of ISRCs available for that UPC
# iii) global_ISRCstreams for UPC

international_UPCs_available_df = adjusted_apple_df.groupby('UPC').sum()
international_UPCs_available_df.info()

international_UPCs_available_df[country_code] = international_UPCs_available_df[country_code].astype('bool')
international_UPCs_available_df.rename(columns={country_code:'UPC_available_in{}'.format(country_code),
                                                'ISRC_available_in{}'.format(country_code):'num_ISRCs_available_{}'.format(country_code),
                                                'total_streams':'global_ISRCstreams_perUPC'},
                                       inplace=True)
international_UPCs_available_df.info()
international_UPCs_available_df.head(2)

## Rights data

### Reupload wordy columns from rights data

In [0]:
# Reupload wordyinfo (e.g. Artist, Title etc) about each UPC

rights_wordyinfo_df = pd.read_csv(path+'rights_wordyinfo.csv',
                                  dtype={'UPC': str}).set_index('UPC')

rights_wordyinfo_df.info()

### Reupload wide Rights data of boolean columns

In [0]:
# Reupload & concat all Rights information

# Only columns for defined country_code
defined_country_cols = []
for daterange in ['2015to18','2010to14','before2010']: # this for loop ensures all columns from the three different csvs are included
    defined_country_cols += [col for col in pd.read_csv(path+'rights_{}_allcountries.csv'.format(daterange),nrows=5).columns if country_code in col]
defined_country_cols = ['UPC'] + list(set(defined_country_cols))

# Some year_ranges have columns which are not present for all year_ranges. This is if a country code has changed etc
reuploaded_rights_all_countries_df = pd.concat(
    [pd.read_csv(path+'rights_2015to18_allcountries.csv',
                 dtype={'UPC': str},
                 usecols=defined_country_cols),
     pd.read_csv(path+'rights_2010to14_allcountries.csv',
                 dtype={'UPC': str},
                 usecols=defined_country_cols),
     pd.read_csv(path+'rights_before2010_allcountries.csv',
                 dtype={'UPC': str},
                 usecols=defined_country_cols)],
    ignore_index=True)
reuploaded_rights_all_countries_df.info()

# ---------------------------
# Ensure only one row for each UPC
reuploaded_rights_all_countries_df = reuploaded_rights_all_countries_df.groupby('UPC').sum().astype('bool')

reuploaded_rights_all_countries_df.info()
reuploaded_rights_all_countries_df.head(2)

In [0]:
path

In [0]:
tester = pd.read_csv(path+'rights_2015to18_allcountries.csv',
                 dtype={'UPC': str})#,
#                  usecols=defined_country_cols)
tester.info()

In [0]:
tester['UPC'].sort_values().dropna()

# Merged UPC table for selected country

In [0]:
# Using the reuploaded Rights data as the base, merge on Apple availablity information and Rights wordy info
# This creates the reference table for the selected country_code

# Merge Rights data & derived Apple availability data, for each UPC for selected country_code
selected_country_UPCs_df = pd.merge(reuploaded_rights_all_countries_df,
                                    international_UPCs_available_df,
                                    how='left',
                                    left_index=True,
                                    right_index=True).fillna({'UPC_available_inGB':False,
                                                              'num_ISRCs_available_GB':0,
                                                              'global_ISRCstreams_perUPC':0})
selected_country_UPCs_df.info()

# ------------------------
# Merge wordy info
selected_country_UPCs_df = pd.merge(selected_country_UPCs_df,
                                    rights_wordyinfo_df,
                                    how='left',
                                    left_index=True,
                                    right_index=True)
print('\n',selected_country_UPCs_df.shape,'\n')

# ------------------------
# Create propISRCs_onstore for selected country_code
selected_country_UPCs_df['propISRCs_onstore_{}'.format(country_code)] = selected_country_UPCs_df['num_ISRCs_available_{}'.format(country_code)] \
                                                                        / selected_country_UPCs_df['Track Count']
# Happens if Track Count is empty
selected_country_UPCs_df['propISRCs_onstore_{}'.format(country_code)] = selected_country_UPCs_df['propISRCs_onstore_{}'.format(country_code)].fillna(0)

# ------------------------
# Fill in UPCs which are to go live but are not present in ISRC data to equal: live and full prop_onstore
# Set intersection ensures only UPCs which are in the data are included, so makes robust against errors
selected_country_UPCs_df.loc[(set(UPCs_to_make_live_df['UPC']) & set(selected_country_UPCs_df.index)),
                             'UPC_available_in{}'.format(country_code)] = True
selected_country_UPCs_df.loc[(set(UPCs_to_make_live_df['UPC']) & set(selected_country_UPCs_df.index)),
                             'propISRCs_onstore_{}'.format(country_code)] = 1

selected_country_UPCs_df.info()
selected_country_UPCs_df.head(2)

# Numbers Matrix

In [0]:
# Create helpful number matrix for selected country_code

# Group by all necessary columns to create matrix figures
selected_country_matrix = pd.DataFrame(
    selected_country_UPCs_df.groupby(
        [country_col_dict['legal'],
         country_col_dict['marketing'],
         country_col_dict['optin'],
         country_col_dict['available'],
         'album',
         'single',
         'duplicate_artist_title',
         'release_before_2018',
         'binary_status']
    ).size()).rename(columns={0:'size'}).reset_index()

# Round for nicer figures
selected_country_matrix['size'] = selected_country_matrix['size'].apply(lambda x: round(x,-2))

# ------------------------
# Create 'category' column for legal-marketing-optin-available matrix 
# country_col_dict defined in 'Define Your Country' at top
def create_category(row):    
    if not row[country_col_dict['legal']]:
        return 'no_legal'
    elif (row[country_col_dict['legal']] and not row[country_col_dict['marketing']]):
        return 'legal_nomarketing'
    elif (row[country_col_dict['legal']] 
          and row[country_col_dict['marketing']] 
          and not row[country_col_dict['optin']]):
        return 'legal_marketing_no_optin'
    elif (row[country_col_dict['legal']] 
          and row[country_col_dict['marketing']] 
          and row[country_col_dict['optin']] 
          and not row[country_col_dict['available']] 
          and row['release_before_2018']):
        return 'legal_marketing_optin_notavailable_before2018'
    elif (row[country_col_dict['legal']] 
          and row[country_col_dict['marketing']] 
          and row[country_col_dict['optin']] 
          and not row[country_col_dict['available']] 
          and not row['release_before_2018']):
        return 'legal_marketing_optin_notavailable_2018onwards'
    else:
        return 'all_in'
selected_country_matrix[country_col_dict['category']] = selected_country_matrix.apply(create_category,axis=1)

tester_matrix = selected_country_matrix.copy()

# ------------------------
# Pivot to nice table format
selected_country_matrix = pd.pivot_table(selected_country_matrix,
                                values='size',
                                index=['album','duplicate_artist_title','binary_status'],
                                columns=[country_col_dict['category']],
                                aggfunc='sum')
# Sum each row
selected_country_matrix['sum'] = selected_country_matrix.sum(axis=1)

# Reorder columns, from all_in to no_legal
selected_country_matrix = selected_country_matrix[['all_in',
                                 'legal_marketing_optin_notavailable_2018onwards',
                                 'legal_marketing_optin_notavailable_before2018',
                                 'legal_marketing_no_optin',
                                 'legal_nomarketing',
                                 'no_legal',
                                 'sum']]

# Show matrix, and show sum for each column in matrix
selected_country_matrix
selected_country_matrix.sum()

# For creating new list of UPCs to make live

In [0]:
# Use this to create new csv of UPCs to make live
# Be careful you're running this on the correct phase of the data, and the correct country

# # JUST A HOLDER QUERY. Does not run in order as relies on first running selected_country_UPCs_df
# query_string = """
# album
# & not duplicate_artist_title 
# & release_before_2018
# & {legal}
# & {marketing}
# & not {optin}
# & {prop_onstore} < 0.8
# """.format(**country_col_dict).replace('\n',' ')
# UPCs_to_make_live_df = selected_country_UPCs_df.query(query_string).index.to_frame(index=False)

# UPCs_to_make_live_df.to_csv(path+'UPCs_to_make_live.csv',index=False)

# Testing

In [0]:
# # Some tests to ensure process did what I wanted it to do

# # initial_tester_matrix = tester_matrix.copy()
# # initial_selected_country_UPCs_df = selected_country_UPCs_df.copy()

# print('\n',initial_selected_country_UPCs_df['propISRCs_onstore_GB'].mean(),'\n')
# print('\n',selected_country_UPCs_df['propISRCs_onstore_GB'].mean(),'\n')
# initial_tester_matrix.query('album & not duplicate_artist_title & release_before_2018 & {legal} & {marketing} & not {optin}'.format(**country_col_dict))
# tester_matrix.query('album & not duplicate_artist_title & release_before_2018 & {legal} & {marketing} & not {optin}'.format(**country_col_dict))