# Data Integration

This notebook takes the raw data from the [10+ M. Beatport Tracks / Spotify Audio Features](https://www.kaggle.com/datasets/mcfurland/10-m-beatport-tracks-spotify-audio-features), combines into a single .csv file along with the popularity we obtained from Spotify for each individual track. 

## Requirements

As we are currently experiencing trouble with Git Large File System, we are working with large files locally. The following raw data files need to be present in the folder:
```
'.../data/raw kaggle data/'
```
in order to generate the final dataset used:

| File Name                | Link                                                                                       |
| -------------------------|------------------------------------------------------------------------------------------- |
| `audio_features.csv`     | [Download](https://www.kaggle.com/datasets/mcfurland/10-m-beatport-tracks-spotify-audio-features/?select=audio_features.csv) |
| `sp_artist.csv`          | [Download](https://www.kaggle.com/datasets/mcfurland/10-m-beatport-tracks-spotify-audio-features/?select=sp_artist.csv) |
| `sp_release.csv`         | [Download](https://www.kaggle.com/datasets/mcfurland/10-m-beatport-tracks-spotify-audio-features/?select=sp_release.csv) |
| `sp_track.csv`           | [Download](https://www.kaggle.com/datasets/mcfurland/10-m-beatport-tracks-spotify-audio-features/?select=sp_track.csv) |
| `sp_artist_release.csv`  | [Download](https://www.kaggle.com/datasets/mcfurland/10-m-beatport-tracks-spotify-audio-features/?select=sp_artist_release.csv) |

## Process

1. Select popular releases of type single
2. Extract all tracks from these releases
3. Add individual track popularity obtained from Spotify Web API
4. Add track data from remaining CSV files 
5. Clean up the final set

#### Notebook set up

**IMPORTANT!** ONLY RUN ONCE AT THE START!

In [1]:
import os
import pandas as pd

# IMORTANT! ONLY RUN THIS CELL ONCE
repo_dir = os.getcwd()[:-26]

# Use os.path.join for creating file paths
repo_dir = os.path.dirname(os.getcwd())
data_dir = os.path.join(repo_dir, '../data')
raw_data_dir = os.path.join(data_dir, 'raw kaggle data')
chunks_popularity_dir = os.path.join(data_dir, 'chunks_popularity')

os.chdir(raw_data_dir)

You will need 4GB of free RAM to load all raw data files into data frames.

In [2]:
# Load all relevant .csv files
sp_release_df= pd.read_csv("sp_release.csv")
sp_track_df = pd.read_csv("sp_track.csv")
sp_artist_df = pd.read_csv("sp_artist.csv")
sp_artist_release_df = pd.read_csv("sp_artist_release.csv")
audio_features_df = pd.read_csv("audio_features.csv")

In [3]:
# We obtained popularity for indivudial songs in chunks of 5000 tracks. See README.md for more details

# Initialize an empty list to store DataFrames
dfs = []

# List all CSV files in the "...data/chunks_popularity/"folder
csv_files = [file for file in os.listdir(chunks_popularity_dir) if file.endswith('.csv')]

# Loop through the CSV files and read them into DataFrames
for file in csv_files:
    file_path = os.path.join(chunks_popularity_dir, file)
    df = pd.read_csv(file_path)
    dfs.append(df)

# Concatenate all the DataFrames into a single DataFrame
popularity_tracks_df = pd.concat(dfs, ignore_index=True)

### 1. Select popular releases of type single

In [4]:
# Select only releases of type single that have some popularity 
sp_release_singles_df = sp_release_df.groupby(sp_release_df['album_type']).get_group('single')
popular_singles_df = sp_release_singles_df[sp_release_singles_df['popularity']!=0]
popular_singles_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119983 entries, 37 to 713562
Data columns (total 10 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   release_id     119983 non-null  object 
 1   release_title  119982 non-null  object 
 2   release_date   119983 non-null  object 
 3   upc            119980 non-null  float64
 4   popularity     119983 non-null  int64  
 5   total_tracks   119983 non-null  int64  
 6   album_type     119983 non-null  object 
 7   release_img    119974 non-null  object 
 8   label_name     119982 non-null  object 
 9   updated_on     119983 non-null  object 
dtypes: float64(1), int64(2), object(7)
memory usage: 10.1+ MB


### 2. Extract all tracks from these releases

In [5]:
#Filtering on release_id and retaining relevant features

filter_ids = popular_singles_df['release_id'].values

mask = sp_track_df['release_id'].isin(filter_ids)

filtered_data_ids = sp_track_df.loc[mask, 'track_id']
filtered_data_isrc = sp_track_df.loc[mask, 'isrc']
filtered_data_explicit = sp_track_df.loc[mask, 'explicit']
filtered_data_title = sp_track_df.loc[mask, 'track_title'] 
filtered_data_sample = sp_track_df.loc[mask, 'preview_url']
filtered_data_release_id = sp_track_df.loc[mask, 'release_id']

popular_tracks_df = pd.DataFrame({
    'track_id': filtered_data_ids, 
    'isrc': filtered_data_isrc, 
    'explicit': filtered_data_explicit,
    'track_title': filtered_data_title,
    'preview_url' : filtered_data_sample,
    'release_id': filtered_data_release_id})
popular_tracks_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 507644 entries, 1312 to 5777704
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   track_id     507644 non-null  object
 1   isrc         507639 non-null  object
 2   explicit     507644 non-null  object
 3   track_title  507641 non-null  object
 4   preview_url  502144 non-null  object
 5   release_id   507644 non-null  object
dtypes: object(6)
memory usage: 27.1+ MB


In [6]:
# Later on will be merging on isrc so we:
popular_tracks_df = popular_tracks_df.dropna(subset=['isrc'])
popular_tracks_df = popular_tracks_df.drop_duplicates(subset=['isrc'])

In [7]:
# Now we merge the two to get a dataframe with all tracks contained in a popular single release
merged_popular_tracks_df = pd.merge(popular_tracks_df, popular_singles_df, on='release_id', how='left')

# Rename 'popularity' to 'release_popularity' to avoid confusion (same for total_tracks)
merged_popular_tracks_df = merged_popular_tracks_df.rename(columns={
    'popularity':'release_popularity', 
    'total_tracks':'total_tracks_in_release'})
merged_popular_tracks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505757 entries, 0 to 505756
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   track_id                 505757 non-null  object 
 1   isrc                     505757 non-null  object 
 2   explicit                 505757 non-null  object 
 3   track_title              505754 non-null  object 
 4   preview_url              500285 non-null  object 
 5   release_id               505757 non-null  object 
 6   release_title            505756 non-null  object 
 7   release_date             505757 non-null  object 
 8   upc                      505741 non-null  float64
 9   release_popularity       505757 non-null  int64  
 10  total_tracks_in_release  505757 non-null  int64  
 11  album_type               505757 non-null  object 
 12  release_img              505720 non-null  object 
 13  label_name               505753 non-null  object 
 14  upda

### 3.Add individual track popularity obtained from Spotify Web API

In [8]:
popularity_tracks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505694 entries, 0 to 505693
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   track_id    505694 non-null  object 
 1   popularity  378050 non-null  float64
 2   Popularity  127644 non-null  float64
 3   updated_on  39551 non-null   object 
 4   udated_on   115000 non-null  object 
dtypes: float64(2), object(3)
memory usage: 19.3+ MB


Problem: We have created two different columns because of the spelling differences so it automatically assigns NaN values to the other one.

In [9]:
# Combined the "Popularity" and "popularity" columns into a new column called "Combined_Popularity"
popularity_tracks_df['Combined_Popularity'] = popularity_tracks_df['Popularity'].fillna(popularity_tracks_df['popularity'])

# Dropping the original "Popularity" and "popularity" columns
popularity_tracks_df.drop(['Popularity', 'popularity','udated_on', 'updated_on',], axis=1, inplace=True)

popularity_tracks_df = popularity_tracks_df.rename(columns={'Combined_Popularity': 'popularity'})
popularity_tracks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505694 entries, 0 to 505693
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   track_id    505694 non-null  object 
 1   popularity  505694 non-null  float64
dtypes: float64(1), object(1)
memory usage: 7.7+ MB


In [10]:
merged_popular_tracks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505757 entries, 0 to 505756
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   track_id                 505757 non-null  object 
 1   isrc                     505757 non-null  object 
 2   explicit                 505757 non-null  object 
 3   track_title              505754 non-null  object 
 4   preview_url              500285 non-null  object 
 5   release_id               505757 non-null  object 
 6   release_title            505756 non-null  object 
 7   release_date             505757 non-null  object 
 8   upc                      505741 non-null  float64
 9   release_popularity       505757 non-null  int64  
 10  total_tracks_in_release  505757 non-null  int64  
 11  album_type               505757 non-null  object 
 12  release_img              505720 non-null  object 
 13  label_name               505753 non-null  object 
 14  upda

In [11]:
# Now we merge the two to get a dataframe with all tracks contained in a popular single release
merged_popular_tracks_df = pd.merge(merged_popular_tracks_df, popularity_tracks_df, on='track_id', how='left')
merged_popular_tracks_df = merged_popular_tracks_df.drop_duplicates(subset=['isrc'])
merged_popular_tracks_df = merged_popular_tracks_df.dropna(subset=['popularity'])
merged_popular_tracks_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 498826 entries, 0 to 510748
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   track_id                 498826 non-null  object 
 1   isrc                     498826 non-null  object 
 2   explicit                 498826 non-null  object 
 3   track_title              498823 non-null  object 
 4   preview_url              493391 non-null  object 
 5   release_id               498826 non-null  object 
 6   release_title            498825 non-null  object 
 7   release_date             498826 non-null  object 
 8   upc                      498810 non-null  float64
 9   release_popularity       498826 non-null  int64  
 10  total_tracks_in_release  498826 non-null  int64  
 11  album_type               498826 non-null  object 
 12  release_img              498789 non-null  object 
 13  label_name               498822 non-null  object 
 14  updated_o

### 4. Add track data from remaining CSV files 

In [12]:
# Add audio features to the tracks
popular_tracks_w_audio_features_df = pd.merge(
    merged_popular_tracks_df,
    audio_features_df,
    on="isrc",
    how='inner')
popular_tracks_w_audio_features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489524 entries, 0 to 489523
Data columns (total 30 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   track_id                 489524 non-null  object 
 1   isrc                     489524 non-null  object 
 2   explicit                 489524 non-null  object 
 3   track_title              489521 non-null  object 
 4   preview_url              484327 non-null  object 
 5   release_id               489524 non-null  object 
 6   release_title            489523 non-null  object 
 7   release_date             489524 non-null  object 
 8   upc                      489508 non-null  float64
 9   release_popularity       489524 non-null  int64  
 10  total_tracks_in_release  489524 non-null  int64  
 11  album_type               489524 non-null  object 
 12  release_img              489487 non-null  object 
 13  label_name               489520 non-null  object 
 14  upda

In [13]:
sp_artist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 676911 entries, 0 to 676910
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   artist_id    676911 non-null  object
 1   artist_name  676900 non-null  object
 2   updated_on   676911 non-null  object
dtypes: object(3)
memory usage: 15.5+ MB


In [14]:
sp_artist_release_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3969489 entries, 0 to 3969488
Data columns (total 3 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   release_id  object
 1   artist_id   object
 2   updated_on  object
dtypes: object(3)
memory usage: 90.9+ MB


In [15]:
# Merge the artists with their releases
sp_artist_df = pd.merge(sp_artist_df, sp_artist_release_df, on='artist_id', how='left')
sp_artist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3969489 entries, 0 to 3969488
Data columns (total 5 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   artist_id     object
 1   artist_name   object
 2   updated_on_x  object
 3   release_id    object
 4   updated_on_y  object
dtypes: object(5)
memory usage: 151.4+ MB


In [16]:
# Now merge everything to produce the final set
final_set_df = pd.merge(popular_tracks_w_audio_features_df, sp_artist_df, on='release_id', how='left')
final_set_df = final_set_df.drop_duplicates(subset=['isrc'])
final_set_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 489524 entries, 0 to 1407325
Data columns (total 34 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   track_id                 489524 non-null  object 
 1   isrc                     489524 non-null  object 
 2   explicit                 489524 non-null  object 
 3   track_title              489521 non-null  object 
 4   preview_url              484327 non-null  object 
 5   release_id               489524 non-null  object 
 6   release_title            489523 non-null  object 
 7   release_date             489524 non-null  object 
 8   upc                      489508 non-null  float64
 9   release_popularity       489524 non-null  int64  
 10  total_tracks_in_release  489524 non-null  int64  
 11  album_type               489524 non-null  object 
 12  release_img              489487 non-null  object 
 13  label_name               489520 non-null  object 
 14  updated_

In [17]:
#Load new dataset for preview_url
preview_url_df = pd.read_csv("preview_url.csv")
print(preview_url_df.head())

                                         preview_url                track_id
0  https://p.scdn.co/mp3-preview/5089e683a4b81f8f...  1TJfx5wZrjAtvGc2LcRD50
1  https://p.scdn.co/mp3-preview/bb53801d7b939b2b...  1UmipCq9fDAl0457eOrnAq
2  https://p.scdn.co/mp3-preview/7f0559e58f37f040...  3gTN5N6qtxokiy1MSAyRHf
3  https://p.scdn.co/mp3-preview/7c15952f6d283d15...  6f43tdfJ9BlUBLzvKim98A
4  https://p.scdn.co/mp3-preview/1883d8ef8a5628f0...  6QXxvRGql81hrkNKl1sJNy


In [18]:
# Merge the new preview URL data into the final dataset using track_id
# The 'suffixes' argument is used to handle any duplicate column names
final_set_df = pd.merge(final_set_df, preview_url_df, on='track_id', how='left', suffixes=('', '_new'))

# Check if 'preview_url_new' is in the final dataset after merging
if 'preview_url_new' in final_set_df.columns:
    # Replace the old preview_url column with the new one from preview_url_df
    final_set_df['preview_url'] = final_set_df['preview_url_new']
    # Drop the now redundant 'preview_url_new' column
    final_set_df.drop('preview_url_new', axis=1, inplace=True)
else:
    # If 'preview_url_new' is not in the DataFrame, it means the column name in preview_url_df was different
    # or there was no conflicting 'preview_url' column in the original DataFrame
    print("Column 'preview_url_new' not found. Check the column names in preview_url_df.")

# Check the first few rows of the updated final dataset
print(final_set_df.head())

                 track_id          isrc explicit  \
0  1TJfx5wZrjAtvGc2LcRD50  NLE800510140        t   
1  3gTN5N6qtxokiy1MSAyRHf  NLE800510143        t   
2  6QXxvRGql81hrkNKl1sJNy  GBKQU1986237        f   
3  6R5Ut5Mb88EqyLLiNaIOD0  NLE802200356        f   
4  7o7Qx3kulN6A0uOaAPf5Vz  NLE802200334        f   

                       track_title  \
0           XTC Love - Radio Mix 1   
1          XTC Love - Original Mix   
2                          blinded   
3  Time To Dance Again - Arena Mix   
4              Time To Dance Again   

                                         preview_url              release_id  \
0  https://p.scdn.co/mp3-preview/5089e683a4b81f8f...  1X5sxCP21FYyT1G4laJRPa   
1  https://p.scdn.co/mp3-preview/7f0559e58f37f040...  1X5sxCP21FYyT1G4laJRPa   
2  https://p.scdn.co/mp3-preview/1883d8ef8a5628f0...  1CBV9MWSJ4hzlG7FswiFhS   
3  https://p.scdn.co/mp3-preview/6b57227204be23bd...  2Ai0JEUm1XYMsOMWMHgnod   
4  https://p.scdn.co/mp3-preview/4229181352cb0c16...  2Ai0

### 5. Clean up the final set

In [21]:
# Re-index after all the merging
final_set_df = final_set_df.reset_index(drop=True)
final_set_df.head(10)

Unnamed: 0,track_id,isrc,explicit,track_title,preview_url,release_id,release_title,release_date,upc,release_popularity,...,mode,speechiness,tempo,time_signature,valence,updated_on_y_x,artist_id,artist_name,updated_on_x_y,updated_on_y_y
0,1TJfx5wZrjAtvGc2LcRD50,NLE800510140,t,XTC Love - Radio Mix 1,https://p.scdn.co/mp3-preview/5089e683a4b81f8f...,1X5sxCP21FYyT1G4laJRPa,XTC Love (Remixes),1995,8715576000000.0,25,...,1,0.0622,174,4,0.751,2023-08-28 18:28:29,3rHHWDHE0Nh2ulbinBUT7L,Bertocucci Feranzano,2023-08-22 18:08:59,2023-08-22 18:08:59
1,3gTN5N6qtxokiy1MSAyRHf,NLE800510143,t,XTC Love - Original Mix,https://p.scdn.co/mp3-preview/7f0559e58f37f040...,1X5sxCP21FYyT1G4laJRPa,XTC Love (Remixes),1995,8715576000000.0,25,...,0,0.128,166,4,0.281,2023-08-28 18:28:29,3rHHWDHE0Nh2ulbinBUT7L,Bertocucci Feranzano,2023-08-22 18:08:59,2023-08-22 18:08:59
2,6QXxvRGql81hrkNKl1sJNy,GBKQU1986237,f,blinded,https://p.scdn.co/mp3-preview/1883d8ef8a5628f0...,1CBV9MWSJ4hzlG7FswiFhS,blinded,2019-09-13,5054285000000.0,11,...,1,0.0345,82,4,0.119,2023-08-24 09:34:11,2EuvmrUgxr2p9iV0aALuUn,Evate,2023-08-22 18:09:04,2023-08-22 18:09:04
3,6R5Ut5Mb88EqyLLiNaIOD0,NLE802200356,f,Time To Dance Again - Arena Mix,https://p.scdn.co/mp3-preview/6b57227204be23bd...,2Ai0JEUm1XYMsOMWMHgnod,Time To Dance Again,2022-07-28,8715576000000.0,23,...,0,0.341,146,4,0.102,2023-08-24 09:35:14,6C0KWmCdqrLU2LzzWBPbOy,Headhunterz,2023-08-22 18:09:04,2023-08-22 18:09:04
4,7o7Qx3kulN6A0uOaAPf5Vz,NLE802200334,f,Time To Dance Again,https://p.scdn.co/mp3-preview/4229181352cb0c16...,2Ai0JEUm1XYMsOMWMHgnod,Time To Dance Again,2022-07-28,8715576000000.0,23,...,1,0.18,75,4,0.19,2023-08-24 09:35:14,6C0KWmCdqrLU2LzzWBPbOy,Headhunterz,2023-08-22 18:09:04,2023-08-22 18:09:04
5,0zolxiS5uiL5towOJrsJi4,NLE801900579,f,Home,https://p.scdn.co/mp3-preview/4d5c7620737fbd8e...,4MzEIsQjND2tLFpckD4k4i,Home,2019-09-12,8715576000000.0,41,...,0,0.0845,150,4,0.367,2023-08-24 09:35:14,6C0KWmCdqrLU2LzzWBPbOy,Headhunterz,2023-08-22 18:09:04,2023-08-22 18:09:07
6,6AJ1Rk1khe1egig27nXImU,DEKU31800053,f,Pendular,https://p.scdn.co/mp3-preview/d8e54d5d229948b9...,7uyLY8BzV9IBvAPJSXF9O1,Pendular,2018-07-31,5054284000000.0,4,...,0,0.0477,110,4,0.768,2023-08-24 09:33:01,4FXp2weLNGcM2w6YrvXIpK,Crossing Colors,2023-08-22 18:08:52,2023-08-22 18:09:15
7,2KM4KCDbbfOvQ5cgWADd0Z,NLE802100230,f,Way Of Life,https://p.scdn.co/mp3-preview/a8e60e3bd2defda5...,5lEbkoP9agkFBm91HCyr5e,Way Of Life,2021-05-27,8715576000000.0,36,...,0,0.035,150,4,0.123,2023-08-24 09:35:14,5QySqc6yAFDx9m7fedFZmC,Brennan Heart,2023-08-22 18:09:01,2023-08-22 18:09:16
8,7t2gGVF4Q7QnFM9dt9F0qE,NLE802100231,f,Way Of Life - Extended Mix,https://p.scdn.co/mp3-preview/758c6b9f2ff9b87b...,5lEbkoP9agkFBm91HCyr5e,Way Of Life,2021-05-27,8715576000000.0,36,...,0,0.0566,150,4,0.439,2023-08-24 09:35:14,5QySqc6yAFDx9m7fedFZmC,Brennan Heart,2023-08-22 18:09:01,2023-08-22 18:09:16
9,2zyNdmZPg8etdhRpq8luNA,US83Z2201210,f,Mondschein,https://p.scdn.co/mp3-preview/be9b875036d90e36...,7cXung70x5NizMa5nKArbV,Mondschein,2022-02-04,853566100000.0,1,...,1,0.0334,127,4,0.038,2023-08-24 09:36:18,7Lh93l3X1NHoq1M8JnXX6P,Kraft Der Sonne,2023-08-22 18:09:19,2023-08-22 18:09:19


We do not need the following data as it is not relevant to the project:

    - 'isrc'                
    - 'track_title'
    - 'time_signature'
    - 'release_id'          
    - 'release_title'       
    - 'upc'                 
    - 'release_popularity'
    - 'total_tracks_in_release'
    - 'album_type'          
    - 'release_img'         
    - 'label_name'
    - 'artist_id'
    - 'updated_on_x_x'      
    - 'updated_on_y_x'      
     -'updated_on_x_y'      

In [22]:
# Dropping the columns
final_set_df.drop(['isrc', 'track_title', 'release_id', 'release_title', 'upc', 'release_popularity', 
                  'total_tracks_in_release', 'album_type', 'release_img', 'label_name', 'artist_id', 'time_signature',
                  'updated_on_x_x', 'updated_on_y_x', 'updated_on_x_y', 'updated_on_y_y'], axis=1, inplace=True)

In [23]:
final_set_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489524 entries, 0 to 489523
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   track_id          489524 non-null  object 
 1   explicit          489524 non-null  object 
 2   preview_url       472072 non-null  object 
 3   release_date      489524 non-null  object 
 4   popularity        489524 non-null  float64
 5   acousticness      489524 non-null  float64
 6   danceability      489524 non-null  float64
 7   duration_ms       489524 non-null  int64  
 8   energy            489524 non-null  float64
 9   instrumentalness  489524 non-null  float64
 10  key               489524 non-null  int64  
 11  liveness          489524 non-null  float64
 12  loudness          489524 non-null  float64
 13  mode              489524 non-null  int64  
 14  speechiness       489524 non-null  float64
 15  tempo             489524 non-null  int64  
 16  valence           48

In [24]:
# Export to .csv
os.chdir(data_dir)
final_set_df.to_csv('integrated_data.csv')