# Deezer Music Consumption Analysis
## Life cycle of the project


* Understanding the Problem Statement
* Data Collection
* Data Checks to perform
* Exploratory Data Analysis
* Data Per-Processing Modle
* Traning Choose Best Model

# 1) Problem Statement

* Providing a modelling of a tag system and studying the similarities between tags, in order to gain a better understanding of the musical consumption of Deezer users - notably regarding the content played, searched and saved.

# 2) Data Collection
* The dataset is released in the framework of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011) http://ir.ii.uam.es/hetrec2011

## 2.1) Importing data and the required packages

### 2.1.1) Importing Pandas, Numpy, Matplotlib, Seaborn

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### 2.1.2) Import data hetrec2011-lastfm-2k and read each file as Pandas DataFrame

In [4]:
artists_df = pd.read_csv('data/artists.dat', sep='\t', names=['artistID', 'name', 'url', 'pictureURL'])
tags_df = pd.read_csv('data/tags.dat', sep='\t', names=['tagID', 'tagValue'], encoding='latin-1')
user_artists_df = pd.read_csv('data/user_artists.dat', sep='\t', names=['userID', 'artistID', 'weight'])
user_taggedartists_df = pd.read_csv('data/user_taggedartists.dat', sep='\t', names=['userID', 'artistID', 'tagID', 'day', 'month', 'year'])
user_taggedartists_timestamps_df = pd.read_csv('data/user_taggedartists-timestamps.dat', sep='\t', names=['userID', 'artistID', 'tagID', 'timestamp'])
user_friends_df = pd.read_csv('data/user_friends.dat', sep='\t', names=['userID', 'friendID'])

### 2.1.3) Display the information and the first five rows of each DataFrame

In [20]:
artists_df.info()
print(artists_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17633 entries, 0 to 17632
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   artistID    17633 non-null  object
 1   name        17633 non-null  object
 2   url         17633 non-null  object
 3   pictureURL  17189 non-null  object
dtypes: object(4)
memory usage: 551.2+ KB
  artistID               name                                         url  \
0       id               name                                         url   
1        1       MALICE MIZER       http://www.last.fm/music/MALICE+MIZER   
2        2    Diary of Dreams    http://www.last.fm/music/Diary+of+Dreams   
3        3  Carpathian Forest  http://www.last.fm/music/Carpathian+Forest   
4        4       Moi dix Mois       http://www.last.fm/music/Moi+dix+Mois   

                                          pictureURL  
0                                         pictureURL  
1    http://userserve-ak.last.fm/serv

In [21]:
tags_df.info()
print(tags_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11947 entries, 0 to 11946
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tagID     11947 non-null  object
 1   tagValue  11947 non-null  object
dtypes: object(2)
memory usage: 186.8+ KB
   tagID           tagValue
0  tagID           tagValue
1      1              metal
2      2  alternative metal
3      3          goth rock
4      4        black metal


In [22]:
user_artists_df.info()
print(user_artists_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92835 entries, 0 to 92834
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   userID    92835 non-null  object
 1   artistID  92835 non-null  object
 2   weight    92835 non-null  object
dtypes: object(3)
memory usage: 2.1+ MB
   userID  artistID  weight
0  userID  artistID  weight
1       2        51   13883
2       2        52   11690
3       2        53   11351
4       2        54   10300


In [23]:
user_taggedartists_df.info()
print(user_taggedartists_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186480 entries, 0 to 186479
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   userID    186480 non-null  object
 1   artistID  186480 non-null  object
 2   tagID     186480 non-null  object
 3   day       186480 non-null  object
 4   month     186480 non-null  object
 5   year      186480 non-null  object
dtypes: object(6)
memory usage: 8.5+ MB
   userID  artistID  tagID  day  month  year
0  userID  artistID  tagID  day  month  year
1       2        52     13    1      4  2009
2       2        52     15    1      4  2009
3       2        52     18    1      4  2009
4       2        52     21    1      4  2009


In [24]:
user_taggedartists_timestamps_df.info()
print(user_taggedartists_timestamps_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186480 entries, 0 to 186479
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   userID     186480 non-null  object
 1   artistID   186480 non-null  object
 2   tagID      186480 non-null  object
 3   timestamp  186480 non-null  object
dtypes: object(4)
memory usage: 5.7+ MB
   userID  artistID  tagID      timestamp
0  userID  artistID  tagID      timestamp
1       2        52     13  1238536800000
2       2        52     15  1238536800000
3       2        52     18  1238536800000
4       2        52     21  1238536800000


In [25]:
user_friends_df.info()
print(user_friends_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25435 entries, 0 to 25434
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   userID    25435 non-null  object
 1   friendID  25435 non-null  object
dtypes: object(2)
memory usage: 397.6+ KB
   userID  friendID
0  userID  friendID
1       2       275
2       2       428
3       2       515
4       2       761


### 2.1.4) Show the shape of each DataFrame

In [12]:
print('1- artists_df dimentions: ', artists_df.shape)
print('2- tags_df: ', tags_df.shape)
print('3- user_artists_df dimentions: ', user_artists_df.shape)
print('4- user_friends_df dimentions: ', user_friends_df.shape)
print('5- user_taggedartists_timestamps_df dimentions: ', user_taggedartists_timestamps_df.shape)
print('6- user_taggedartists_df dimentions: ', user_taggedartists_df.shape)

1- artists_df dimentions:  (17633, 4)
2- tags_df:  (11947, 2)
3- user_artists_df dimentions:  (92835, 3)
4- user_friends_df dimentions:  (25435, 2)
5- user_taggedartists_timestamps_df dimentions:  (186480, 4)
6- user_taggedartists_df dimentions:  (186480, 6)


## 2.2) Dataset information

The package contains 6 files with the following description:
   
   * artists.dat
   
     This file contains information about music artists listened and tagged by the users.
   
   * tags.dat
   
   	 This file contains the set of tags available in the dataset.

   * user_artists.dat
   
     This file contains the artists listened by each user.
        
     It also provides a listening count for each [user, artist] pair.

   * user_taggedartists.dat - user_taggedartists-timestamps.dat
   
     These files contain the tag assignments of artists provided by each particular user.
        
     They also contain the timestamps when the tag assignments were done.
   
   * user_friends.dat
   
   	 These files contain the friend relations between users in the database.

# 3) Data checks to perform
* Check missing values
* Check duplicates
* Check data types
* Check the number of unique values of each column
* Check statistics of Dataset
* Check various categories present in different categorical columns

## 3.1) check missing values

In [18]:
artists_df.isna().sum()
tags_df.isna().sum()

tagID       0
tagValue    0
dtype: int64