# Capstone: Kpop Group Lifespan Analysis

### Overall Contents:
- [Problem Statement](#1.-Problem-Statement) **(In this notebook)**
- [Data Cleaning](#2.-Data-Cleaning) **(In this notebook)**
- Exploratory Data Analysis
- Modeling
- Evaluation
- Conclusion and Recommendation

## 1. Problem Statement

Working in one of the major music entertainment company in korea, our company has multiple music group of various performance under our care. The competition in this industry is very tough. Every music group have to ensure they are constantly able to bring about new music to contiune to capture their audience or to capture new ones to remain in the industry. Once they are unable to do so other group will take this oppertunity to overtake them. The role of the music entertainment company is not only to help promote the exsiting music group but to also recuit trainees and prepare them for their debut should there be signs of the music group on the verge of disbandment. However this is a long process, as new trainees often have to go through many years of training before they are even considered when a new group is slated to be form. Thus a way to identify the remainding time before a group would be disbanded would help save the company time and resources spent on the trainees. Thus we are tasked to come up with a model to predict the expected lifespan for a music group before it would be disbanded based on their current performance history thus far. 

### 1.1 Datasets

The Dataset contains details of the various music groups in Korea. The data are obtained from a combination of various website such as [kpop fandom](https://kpop.fandom.com/wiki/Main_Page), [dbkpop](https://dbkpop.com/) and [GoogleTrend](googletrendhttps://trends.google.com/trends/explore?date=all&q=%2Fm%2F02yh8l)

The datasets obtained are as followed:-

* musicgroup_df (data of all music groups)
* disband_df (data of music groups that have been officially disbanded)
* active_df (data of music groups that have not been officially disbanded)

## 2. Data Cleaning

### 2.1 Libraries Import

In [1]:
# Imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%config InlineBackend.figure_format = 'retina'
%matplotlib inline 
# Maximum display of columns
pd.options.display.max_colwidth = 400
pd.options.display.max_rows = 400

### 2.2 Data Import

In [2]:
# Import data from csv
musicgroup_df = pd.read_csv('../data/MusicGroups.csv')
kpoptrend_df = pd.read_csv("../data/KpopTrend2014To2021.csv")

### 2.3 Data Cleaning

#### 2.3.1 Overview & Info

In [3]:
# Header of musicgroup_df
musicgroup_df.head()

Unnamed: 0,Group Name,Company,Group Type,Debut Year,Disband Year,Current Status,Social Accounts,Inactive Members,Current Members,Original Members Remainding,...,Digital Singles,Other Singles,Foreign Albums,Foreign Mini Albums,Foreign Other Album,Foreign Singles,Foreign Digital Singles,Foreign Other Singles,Others,Last Production Year
0,(G)I-DLE,Cube Entertainment,Female,2018,,Active,10,1,5,5,...,4,0,0,2,0,2,2,0,5,2021
1,10X10,Gaon Entertainment,Female,2015,2016.0,Disband,2,0,6,6,...,3,0,0,0,0,0,0,0,0,2016
2,1NB,Trivus Entertainment,Female,2017,2018.0,Disband,4,0,5,5,...,6,0,0,0,0,0,0,0,0,2018
3,1PS,Maroo Entertainment,Female,2014,2015.0,Disband,3,0,5,4,...,1,0,0,0,0,0,0,0,1,2014
4,2EYES,SidusHQ,Female,2013,2017.0,Disband,2,0,4,4,...,1,0,0,0,0,0,0,0,1,2015


In [4]:
# info of musicgroup_df
musicgroup_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 705 entries, 0 to 704
Data columns (total 27 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Group Name                   705 non-null    object 
 1   Company                      705 non-null    object 
 2   Group Type                   705 non-null    object 
 3   Debut Year                   705 non-null    int64  
 4   Disband Year                 374 non-null    float64
 5   Current Status               705 non-null    object 
 6   Social Accounts              705 non-null    int64  
 7   Inactive Members             705 non-null    int64  
 8   Current Members              705 non-null    int64  
 9   Original Members Remainding  705 non-null    int64  
 10  Initial Members              705 non-null    int64  
 11  Member Changes               705 non-null    int64  
 12  SubUnits                     705 non-null    int64  
 13  Albums              

**Analysis: Missing values in column 'Disband Year', spacing and capital letters in column names, check for correct date range in all 'year' columns, check company values**

In [5]:
kpoptrend_df.head()

Unnamed: 0,Category: Arts & Entertainment
Month,K-pop: (Worldwide)
2004-01,6
2004-02,5
2004-03,3
2004-04,7


In [6]:
kpoptrend_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 210 entries, Month to 2021-05
Data columns (total 1 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Category: Arts & Entertainment  210 non-null    object
dtypes: object(1)
memory usage: 3.3+ KB


**Analysis: Data only start from 2004 and untill 2021 May. First row should be made into columns names instead**

#### 2.3.2. Columns tidying and handling missing values (musicgroup_df)

In [7]:
# change all columns names to lowercase and replace spacing with underscore.
musicgroup_df.columns = musicgroup_df.columns.str.replace(' ', '_')
musicgroup_df.columns = musicgroup_df.columns.str.lower()

In [8]:
# handle the null cells in the 'disband_year' column
musicgroup_df.disband_year.fillna('0', inplace = True)
musicgroup_df.disband_year = musicgroup_df.disband_year.astype('int64')

#### 2.3.3. Check date ranges (musicgroup_df)

In [11]:
musicgroup_df.debut_year.unique()

array([2018, 2015, 2017, 2014, 2013, 2009, 2019, 2003, 2012, 2010, 2016,
       2006, 2020, 2001, 2011, 1997, 1999, 1998, 2021, 2005, 2007, 2000,
       2002, 1990, 2008, 1996, 1992, 2004, 1993, 1995], dtype=int64)

In [12]:
musicgroup_df.disband_year.unique()

array([   0, 2016, 2018, 2015, 2017, 2009, 2013, 2019, 2020, 2014, 2007,
       2002, 2011, 2021, 2006, 2005, 2001, 1999, 2000, 2012, 2003, 2004,
       1996, 2010, 1997], dtype=int64)

In [13]:
musicgroup_df.last_production_year.unique()

array([2021, 2016, 2018, 2014, 2015, 2017, 2020, 2009, 2012, 2019, 2006,
       2001, 2011, 2004, 2013, 2005, 2000, 1997, 2007, 2002, 1999, 1996,
       2010], dtype=int64)

#### 2.3.3. Check company values (musicgroup_df)

In [14]:
musicgroup_df.company = musicgroup_df.company.str.replace(' ', '_')
musicgroup_df.company = musicgroup_df.company.str.lower()

#### 2.3.4. Change columns names (kpoptrend_df)

In [15]:
kpoptrend_df.columns = kpoptrend_df.iloc[0]
kpoptrend_df = kpoptrend_df[1:]
kpoptrend_df.rename(columns = {'Month': 'year_month', 'K-pop: (Worldwide)': 'trend'}, inplace = True)

## 2.4. Summary

**Summary**

**For musicgroup_df:**
* The column names have been changed to lowercase and spaces replaced with underscore.
* missing values filled with 0 if they are still active.
* dates related cells are checked for their range 
* company names changed to lowercase and spaces replaced with underscores.

**For kpoptrend_df:**
* There are no missing values. column names changed to make it more readable.


## Exporting Data

In [16]:
# Placed the # to refrain from executing
musicgroup_df.to_csv("../data/musicgroup_cleaned.csv", index = False)
kpoptrend_df.to_csv("../data/kpoptrend_cleaned.csv", index = True)