# **Predictive Analysis of BNK48 & CGM48's 16th Single General Election**

![GE4 General Election Banner](GE4banner.jpeg)

*Image Source: [BNK48 Official YouTube Video](https://www.youtube.com/watch?v=4EGrHyXIvf0)*

<a id='top'></a>

# Table of Contents
1. [Introduction to the 48 Group](#1)
2. [Understanding the General Election](#2)
3. [Data Collection Methodology](#3)
4. [Dataset Overview](#4)
5. [Exploratory Data Analysis](#5)
  - 5.1 [Data Preprocessing](#5.1)
6. [Predictive Model Building](#6)
  - 6.1 [`GE4_Rank` Regression Model](#6-1)
    - 6.1.1 [RandomForestRegressor](#6-1-1)
  - 6.2 [`GE4_Position` Classification Model](#6-2)
    - 6.2.1 [RandomForestClassifier](#6-2-1)
    - 6.2.2 [XGB Classifier](#6-2-2)
    - 6.2.3 [LightGBM Classifier](#6-2-3)
    - 6.2.4 [Emsemble](#6-2-4)
  - 6.3 [Hyperparameter Tuning (Optuna)](#6-3)
7. [Feature Importance Analysis](#7)
8. [Conclusion and Insights](#8)

<a id='1'></a>

## 1. Introduction to the 48 Group

Welcome to the predictive analysis project for the 16th Single General Election of BNK48 and CGM48, the Thai sister groups of the international 48 Group. These idol groups have revolutionized pop culture in Asia, capturing the hearts of fans through music, performances, and interactive fan events.

The 48 Group, which originated in Japan with AKB48, is known for its unique concept of "idols you can meet." The groups have localized sister groups in several countries, each with its own teams of performers. BNK48 and CGM48 are based in Thailand and have garnered a significant following through their engaging performances and public appearances.

<a id='2'></a>

## 2. Understanding the General Election

### General Election

The General Election is a hallmark event for 48 Group fans, where supporters vote for their favorite members. The outcome of this event is crucial as it determines the lineup for the next single and often influences the group's direction. Like the 3rd election, BNK48 and CGM48 utilized a blockchain-based token voting system within the iAM48 application for this 4th election, showcasing a blend of pop culture and cutting-edge technology. Preliminary results are announced to the fans at two separate events before the final results, adding to the excitement and speculation about the final rankings.

### Key Terms

- **Senbatsu**: The selected members who will perform the A-side of a single.
- **Coupling Song**: Additional tracks featured on a single, often performed by non-Senbatsu members.
- **Oshi**: A fan's favorite member, akin to a "most supported" idol.
- **Kami**: Derived from "Kamisama" meaning "God," referring to top-ranked idols.
- **Team**: Subgroups within BNK48 and CGM48, each with distinct identities and songs.
- **Center**: The lead position in a group's formation, often front and center during performances.

### Rank
- **Kami7**: The elite top seven idols as determined by fan votes in the general election. These members are considered the most popular and influential within the group.
- **Senbatsu**: The top 16 members selected to perform on the main track of a single, often considered the face of the group during promotions.
- **Under Girls**: Members who rank 17th to 32nd in the general election. They typically perform on the B-side of a single and are featured in secondary promotions.
- **Next Girls**: Ranks 33 to 48 from the election, these members are featured in additional songs and are recognized for their potential and growing popularity.

<a id='3'></a>

## 3. Data Collection Methodology

### Data Sources
The data for the analysis of the 3rd and 4th General Elections of BNK48 & CGM48 was meticulously gathered from a combination of online sources on 17 December 2023. The primary sources of data were:

- **Twitter Account [@Stats48TH](https://twitter.com/Stats48TH)**: An unofficial but valuable source providing comprehensive statistics and information on the members of BNK48 and CGM48.
- **Wikipedia Pages**: Detailed historical data and election results were extracted from Wikipedia for the:
  - [3rd General Election for the 12th Single](https://th.wikipedia.org/wiki/การเลือกตั้งทั่วไปเซ็มบัตสึบีเอ็นเคโฟร์ตีเอต_ประจำซิงเกิลที่_12)
  - [4th General Election for the 16th Single](https://th.wikipedia.org/wiki/การเลือกตั้งทั่วไปเซ็มบัตสึบีเอ็นเคโฟร์ตีเอต_ประจำซิงเกิลที่_16)
  
### Data Extraction
The data extraction process encompassed multiple steps to ensure the richness and accuracy of the dataset:

- **Reviewing Twitter Data**: An in-depth analysis of data summaries from the Twitter account was conducted to understand the nuances of the election results and member statistics.
- **Wikipedia Research**: The Wikipedia pages for the 3rd and 4th General Elections were scrutinized to validate and enrich the data obtained from Twitter.
- **Database Analysis**: The `Database members BNK48 & CGM48 (10_12_2023 21.45).xlsx`, provided by @Stats48TH, offered a granular view of the election outcomes and member profiles.
- **Power Query (Get Data)**: Power Query in Microsoft Excel was leveraged to perform data extraction, shaping, and transformation, facilitating the integration of various data sources.

### Data Processing and Model Creation
A comprehensive data processing workflow was adopted to refine the raw data and construct a robust data model:

- **OCR (Optical Character Recognition)**: OCR technology was utilized to digitize images and scanned documents, rendering them into an analyzable format.
- **Data Scraping**: Structured data was scraped from the web, transforming unstructured online information into a usable dataset.
- **Data Cleaning**: The dataset was meticulously cleaned to rectify any inaccuracies and standardize the format, thus ensuring the integrity of the analysis.
- **Data Model Creation in PowerPivot**: The data from various sources was combined using PowerPivot in Excel, which facilitated the creation of a data model that underscores the relationships between different entities. The ER Diagram provided a visual representation of this data model, showcasing how the `GE4rank` table was used as the cornerstone for 'Left Joins' with other tables based on the `Name_Band` attribute.

![ER_Diagram](ER_diagram.png)

### Final Dataset
The culmination of the above processes led to the assembly of the final dataset, which offers a holistic portrayal of the election results and member data. This dataset underpins the exploratory data analysis and predictive modeling executed in this project, aiming to extract meaningful insights and patterns.


<a id='4'></a>

## 4. Dataset Overview

This dataset includes detailed statistics from the latest General Election, including votes counted via blockchain technology, member rankings, and various other metrics that could influence the election outcomes.

| Column Name            | Description                                                   | Data Type | Missing | Example Values |
|------------------------|---------------------------------------------------------------|-----------|---------|----------------|
| GE4_Rank               | Rank in the 4th General Election                              | Integer   | 0       | 1, 2, 3 to 64  |
| Name                   | Member's nickname                                             | String    | 0       |                |
| Name_Band              | Combined name and band identifier                             | String    | 0       |                |
| GE4_Token              | Tokens earned in the 4th General Election                     | Float     | 0       |                |
| GE4_Transaction        | Number of transactions in the 4th General Election            | Integer   | 0       |                |
| GE4_Wallet             | Wallet balance during the 4th General Election                | Integer   | 0       |                |
| GE4_Token/Transaction  | Average tokens per transaction in the 4rd GE                  | Float     | 0       |                |
| GE4_Prelim1            | Preliminary round 1 rank in the 4th General Election          | Integer   | 16      |                |
| GE4_Token_Prelim1      | Tokens earned in the preliminary round 1 of the 4th GE        | Float     | 16      |                |
| GE4_Prelim2            | Preliminary round 2 rank in the 4th General Election          | Integer   | 16      |                |
| Band                   | Band name (BNK48 or CGM48)                                    | String    | 0       | BNK48, CGM48   |
| Gen                    | Generation of the member                                      | Integer   | 0       | 1, 2, 3, 4     |
| Band_Gen               | Combined band and generation identifier                       | String    | 0       | BNK48_1, CGM48_2 |
| Full_Name_TH           | Member's full name in Thai                                    | String    | 0       |                |
| Team_Position          | Position within the team                                      | String    | 57      | Captain, Vice Captain, Shihainin |
| Team                   | Team affiliation within the band                              | String    | 0       | BIII, NV, C, Trainee |
| Age                    | Age of the member at the time of 4th General Election         | Integer   | 0       | 20, 21         |
| GE1_Rank               | Rank in the 1st General Election                              | Integer   | 59      |                |
| GE2_Rank               | Rank in the 2nd General Election                              | Integer   | 46      |                |
| GE3_Rank               | Rank in the 3rd General Election                              | Integer   | 22      |                |
| GE3_Position           | Position in the 3rd General Election                          | String    | 23      | Senbatsu, Under Girls, Next Girls, Unranked |
| GE3_Prelim1            | Preliminary round 1 rank in the 3rd General Election          | Integer   | 32      |                |
| GE3_Token_Prelim1      | Tokens earned in the preliminary round 1 of the 3rd GE        | Float     | 32      |                |
| GE3_Prelim2            | Preliminary round 2 rank in the 3rd General Election          | Integer   | 34      |                |
| GE3_Token              | Tokens earned in the 3rd General Election                     | Float     | 34      |                |
| GE3_Transactions       | Number of transactions in the 3rd General Election            | Integer   | 23      |                |
| GE3_Wallet             | Wallet balance during the 3rd General Election                | Integer   | 23      |                |
| GE3_Token/Transaction  | Average tokens per transaction in the 3rd GE                  | Float     | 23      |                |
| GE3_Center             | Whether the member was the center in the 3rd GE (1 for center)| Integer   | 0       | 0, 1           |
| Request_Hour           | Total number of times participation in the Request Hour Concert | Integer | 0       | 1, 2, 3        |
| Game_Caster            | Total number of times participation in the 48TH Game Caster   | Integer   | 0       | 1, 2, 3        |
| Theater_Stage          | Number of times performed on the theater stage                | Integer   | 0       |                |
| iAM48_Kami             | Number of 'Kami' (only supporter) in the iAM48 platform       | Integer   | 0       |                || Column 
| iAM48_Oshi             | Number of 'Oshi' (supporter) in the iAM48 platform               | Integer   | 0       |                     |
| iAM48_Cookies          | Number of cookies received in the iAM48 platform                 | Integer   | 0       |                     |
| iAM48_Likes            | Number of likes received in the iAM48 platform                   | Integer   | 0       |                     |
| Setbatsu_Total         | Total number of times being in the Senbatsu lineup before 4th GE | Integer   | 0       |                     |
| Center_Main_Total      | Total number of times being the main center before 4th GE        | Integer   | 0       |                     |
| GE4_Position           | Position in the 4th General Election                             | String    | 0       | Senbatsu, Under Girls, Next Girls, Unranked |


In [1]:
import pandas as pd

In [2]:
df = pd.read_excel('BNK48_CGM48_df.xlsx')

In [3]:
# Set the option to display all columns (None means no limit to the number of columns)
pd.set_option('display.max_columns', None)

In [4]:
df

Unnamed: 0,GE4_Rank,Name,Name_Band,GE4_Token,GE4_Transaction,GE4_Wallet,GE4_Token/Transaction,GE4_Prelim1,GE4_Token_Prelim1,GE4_Prelim2,Band,Gen,Band_Gen,Full_Name_TH,Team_Position,Team,Age,GE1_Rank,GE2_Rank,GE3_Rank,GE3_Position,GE3_Prelim1,GE3_Token_Prelim1,GE3_Prelim2,GE3_Token,GE3_Transactions,GE3_Wallet,GE3_Token/Transaction,GE3_Center,Request_Hour,Game_Caster,Theater_Stage,iAM48_Kami,iAM48_Oshi,iAM48_Cookies,iAM48_Likes,Setbatsu_Total,Center_Main_Total,GE4_Position
0,1,Pim,Pim_CGM48,133891.5231,1023,570,130.881254,4.0,5590.69,1.0,CGM48,1,CGM48_1,พรวารินทร์ วงศ์ตระกูลกิจ,,C,17,,22.0,5.0,Senbatsu,5.0,5838.43,5.0,59433.96,513.0,129.0,115.86,False,2,6,0,918,12670,17831388,94093,9,1,Senbatsu
1,2,Paeyah,Paeyah_BNK48,110223.6325,841,459,131.062583,5.0,4159.38,5.0,BNK48,3,BNK48_3,นิพพิชฌาน์ พิพิธเดชา,,NV,18,,,24.0,Under Girls,26.0,1001.39,20.0,9752.89,1194.0,382.0,8.17,False,3,0,56,1657,16731,8486508,60637,4,1,Senbatsu
2,3,Kaning,Kaning_CGM48,99316.8750,967,526,102.706179,2.0,10419.31,2.0,CGM48,1,CGM48_1,วิทิตา สระศรีสม,,C,19,,40.0,6.0,Senbatsu,3.0,7309.99,6.0,53234.80,2369.0,929.0,22.47,False,3,27,0,3101,31554,27935669,183438,11,3,Senbatsu
3,4,Minmin,Minmin_BNK48,72812.5690,557,281,130.722745,3.0,6017.28,8.0,BNK48,2,BNK48_2,รชยา ทัพพ์คุณานนต์,,BIII,26,19.0,14.0,10.0,Senbatsu,9.0,5102.41,8.0,36102.21,1166.0,393.0,30.06,False,5,2,148,1404,45844,26256835,182763,9,0,Senbatsu
4,5,Pancake,Pancake_BNK48,52589.7200,484,266,108.656446,28.0,1069.53,21.0,BNK48,3,BNK48_3,พิทยาภรณ์ เกียรติฐิตินันท์,,NV,16,,,43.0,Next Girls,33.0,854.55,38.0,3667.27,737.0,235.0,4.98,False,3,5,43,714,13133,10090840,63289,3,1,Senbatsu
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,60,Berry,Berry_BNK48,920.9499,139,104,6.625539,,,,BNK48,4,BNK48_4,จิรภิญญา จันทวรรณกูร,,Trainee,18,,,,,,,,,,,,False,0,0,0,83,2901,1945863,16496,0,0,Unranked
60,61,Wawa,Wawa_BNK48,669.7650,136,112,4.924743,,,,BNK48,4,BNK48_4,พิมพ์นเรศ ลำใย,,Trainee,14,,,,,,,,,,,,False,0,0,0,126,2870,592968,25511,0,0,Unranked
61,62,Papang,Papang_CGM48,494.7850,215,158,2.301326,,,,CGM48,2,CGM48_2,ศุภัชญา คำเงิน,,Trainee,16,,,,,,,,,,,,False,0,0,0,149,3385,1436223,10568,0,0,Unranked
62,63,Emma,Emma_CGM48,275.6710,182,147,1.514676,,,,CGM48,2,CGM48_2,ศศิชา วงศ์วัฒนอนันต์,,Trainee,19,,,,,,,,,,,,False,0,0,0,62,2421,581191,11824,0,0,Unranked


In [5]:
#df.to_csv('BNK48_CGM48_df.csv', index=False)

<a id='5'></a>

## 5. Exploratory Data Analysis

In [6]:
import sweetviz as sv

# Analyzing the dataset
#report = sv.analyze(df)

# Generating the report
#report.show_notebook()

In [7]:
# Getting a summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 39 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   GE4_Rank               64 non-null     int64  
 1   Name                   64 non-null     object 
 2   Name_Band              64 non-null     object 
 3   GE4_Token              64 non-null     float64
 4   GE4_Transaction        64 non-null     int64  
 5   GE4_Wallet             64 non-null     int64  
 6   GE4_Token/Transaction  64 non-null     float64
 7   GE4_Prelim1            48 non-null     float64
 8   GE4_Token_Prelim1      48 non-null     float64
 9   GE4_Prelim2            48 non-null     float64
 10  Band                   64 non-null     object 
 11  Gen                    64 non-null     int64  
 12  Band_Gen               64 non-null     object 
 13  Full_Name_TH           64 non-null     object 
 14  Team_Position          7 non-null      object 
 15  Team    

In [8]:
# Descriptive statistics of the dataframe
df.describe()

Unnamed: 0,GE4_Rank,GE4_Token,GE4_Transaction,GE4_Wallet,GE4_Token/Transaction,GE4_Prelim1,GE4_Token_Prelim1,GE4_Prelim2,Gen,Age,GE1_Rank,GE2_Rank,GE3_Rank,GE3_Prelim1,GE3_Token_Prelim1,GE3_Prelim2,GE3_Token,GE3_Transactions,GE3_Wallet,GE3_Token/Transaction,Request_Hour,Game_Caster,Theater_Stage,iAM48_Kami,iAM48_Oshi,iAM48_Cookies,iAM48_Likes,Setbatsu_Total,Center_Main_Total
count,64.0,64.0,64.0,64.0,64.0,48.0,48.0,48.0,64.0,64.0,5.0,18.0,42.0,32.0,32.0,30.0,41.0,41.0,41.0,41.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0
mean,32.5,19369.981492,406.390625,236.296875,38.888322,24.5,2082.308958,24.5,2.25,19.8125,18.4,31.166667,35.928571,27.28125,1844.875937,27.466667,9637.234878,844.95122,268.146341,10.878537,2.09375,7.453125,32.421875,1002.046875,17805.96875,9757608.0,91762.609375,3.40625,0.25
std,18.618987,25935.246469,246.527864,119.313736,33.622831,14.0,2273.912177,14.0,1.112697,3.318419,3.577709,11.617684,16.60942,13.006783,1870.861349,13.299555,12764.762065,465.911792,183.300377,17.662012,1.610666,7.700607,48.688125,1624.024429,17443.28327,7494040.0,74172.456294,3.499291,0.590937
min,1.0,209.76,127.0,104.0,1.514676,1.0,352.64,1.0,1.0,14.0,13.0,8.0,5.0,3.0,360.16,5.0,298.71,247.0,39.0,0.91,0.0,0.0,0.0,34.0,1898.0,509091.0,7869.0,0.0,0.0
25%,16.75,5372.64425,247.25,156.5,18.039122,12.75,858.705,12.75,1.0,17.0,17.0,22.75,22.25,18.25,622.935,18.25,3108.14,474.0,129.0,5.21,0.0,0.0,0.0,213.75,5641.5,3964677.0,40691.75,0.0,0.0
50%,32.5,8443.2865,359.5,208.0,28.31218,24.5,1294.235,24.5,2.0,20.0,19.0,32.5,37.5,28.0,961.02,29.5,5008.46,711.0,230.0,7.11,2.0,5.5,0.0,515.5,11151.0,8526086.0,66718.0,2.5,0.0
75%,48.25,24685.51025,434.0,266.0,44.55656,36.25,2535.3475,36.25,3.0,21.25,21.0,39.75,48.75,37.5,2007.8625,37.75,9877.98,1130.0,391.0,10.82,3.0,13.0,52.75,1085.75,24190.25,13368090.0,131324.75,5.25,0.0
max,64.0,133891.5231,1149.0,581.0,131.062583,48.0,11375.59,48.0,4.0,28.0,22.0,48.0,62.0,48.0,7309.99,48.0,59433.96,2369.0,929.0,115.86,6.0,27.0,174.0,11315.0,93145.0,27935670.0,380536.0,12.0,3.0


<a id='5-1'></a>

### 5.1 Data Preprocessing

#### Dropping Columns
- **Identifier Columns:** `Name`, `Name_Band`, `Full_Name_TH`
- **Redundant or Less Significant Columns:** `Gen`, `GE3_Position`, `GE3_Prelim1`, `GE3_Token_Prelim1`, `GE3_Prelim2`

#### Feature Engineering
- **Band_Team:** Concatenated `Band` and `Team` into a new column `Band_Team`, then dropped `Team`.
- **GE1_Par & GE2_Par:** Converted `GE1_Rank` and `GE2_Rank` into binary participation columns (`GE1_Par`, `GE2_Par`) and dropped the original columns.
- **GE3_Center & Team_Position:** Converted to binary format.

#### Missing Data Imputation
- Imputed `GE3_Rank`, `GE4_Prelim1`, and `GE4_Prelim2` with max rank + 1 (max rank assumed to be 48).
- Imputed `GE4_Token_Prelim1`, `GE3_Token`, `GE3_Transactions`, `GE3_Wallet`, `GE3_Token/Transaction` with their respective minimum values.

#### Scaling
- Applied `StandardScaler` to the following columns:
  - `GE4_Token`, `GE4_Transaction`, `GE4_Wallet`, `GE4_Token/Transaction`, `GE4_Token_Prelim1`
  - `Age`
  - `GE3_Token`, `GE3_Transactions`, `GE3_Wallet`, `GE3_Token/Transaction`
  - `Request_Hour`, `Game_Caster`, `Theater_Stage`
  - `iAM48_Kami`, `iAM48_Oshi`, `iAM48_Cookies`, `iAM48_Likes`
  - `Setbatsu_Total`, `Center_Main_Total`

#### Data Type Conversion
- Converted `GE4_Prelim1`, `GE4_Prelim2`, and `GE3_Rank` to integer type.

In [9]:
# Dropping identifier columns
df = df.drop(['Name', 'Name_Band', 'Full_Name_TH'], axis=1)

# Dropping columns where significance is included in other columns
df = df.drop(['Gen', 'GE3_Position', 'GE3_Prelim1', 'GE3_Token_Prelim1', 'GE3_Prelim2'], axis=1)

# Concatenating 'Band' and 'Team' into a new 'Band_Team' column
df['Band_Team'] = df['Band'] + "_" + df['Team']
df = df.drop(['Team'], axis=1)

# Converting GE1_Rank and GE2_Rank to binary participation columns
df['GE1_Par'] = df['GE1_Rank'].notna().astype(int)
df['GE2_Par'] = df['GE2_Rank'].notna().astype(int)
df = df.drop(['GE1_Rank', 'GE2_Rank'], axis=1)

# Imputing missing data for specific columns
max_rank = 48  # max rank is 48
df['GE3_Rank'].fillna(max_rank + 1, inplace=True)
df['GE4_Prelim1'].fillna(max_rank + 1, inplace=True)
df['GE4_Prelim2'].fillna(max_rank + 1, inplace=True)
df['GE4_Token_Prelim1'].fillna(df['GE4_Token_Prelim1'].min(), inplace=True)
df['GE3_Token'].fillna(df['GE3_Token'].min(), inplace=True)
df['GE3_Transactions'].fillna(df['GE3_Transactions'].min(), inplace=True)
df['GE3_Wallet'].fillna(df['GE3_Wallet'].min(), inplace=True)
df['GE3_Token/Transaction'].fillna(df['GE3_Token/Transaction'].min(), inplace=True)

# Converting GE3_Center and Team_Position to binary
df['GE3_Center'] = df['GE3_Center'].notna().astype(int)
df['Team_Position'] = df['Team_Position'].notna().astype(int)

# Changing data types to integer
df['GE4_Prelim1'] = df['GE4_Prelim1'].astype(int)
df['GE4_Prelim2'] = df['GE4_Prelim2'].astype(int)
df['GE3_Rank'] = df['GE3_Rank'].astype(int)

from sklearn.preprocessing import StandardScaler

# Columns to preprocess with StandardScaler
columns_to_scale = [
    'GE4_Token', 'GE4_Transaction', 'GE4_Wallet', 'GE4_Token/Transaction', 'GE4_Token_Prelim1',
    'Age',
    'GE3_Token', 'GE3_Transactions', 'GE3_Wallet', 'GE3_Token/Transaction',
    'Request_Hour', 'Game_Caster', 'Theater_Stage',
    'iAM48_Kami', 'iAM48_Oshi','iAM48_Cookies', 'iAM48_Likes',
    'Setbatsu_Total', 'Center_Main_Total'
]

# Applying StandardScaler to specified columns
scaler = StandardScaler()
df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])

In [10]:
df

Unnamed: 0,GE4_Rank,GE4_Token,GE4_Transaction,GE4_Wallet,GE4_Token/Transaction,GE4_Prelim1,GE4_Token_Prelim1,GE4_Prelim2,Band,Band_Gen,Team_Position,Age,GE3_Rank,GE3_Token,GE3_Transactions,GE3_Wallet,GE3_Token/Transaction,GE3_Center,Request_Hour,Game_Caster,Theater_Stage,iAM48_Kami,iAM48_Oshi,iAM48_Cookies,iAM48_Likes,Setbatsu_Total,Center_Main_Total,GE4_Position,Band_Team,GE1_Par,GE2_Par
0,1,4.450579,2.520948,2.818964,2.757655,4,1.887696,1,CGM48,CGM48_1,0,-0.854242,5,4.813914,-0.250728,-0.312240,7.355500,1,-0.058666,-0.190194,-0.671173,-0.052161,-0.296766,1.085877,0.031667,1.611175,1.279204,Senbatsu,CGM48_C,0,1
1,2,3.530789,1.776858,1.881289,2.763090,5,1.202079,5,BNK48,BNK48_3,0,-0.550512,24,0.314422,1.207858,1.078625,0.059210,1,0.567103,-0.975513,0.488097,0.406478,-0.062114,-0.170956,-0.422956,0.171019,1.279204,Senbatsu,BNK48_NV,0,0
2,3,3.106926,2.291997,2.447273,1.913056,2,4.200672,2,CGM48,CGM48_1,0,-0.246781,6,4.252471,3.724507,4.085752,1.028073,1,0.567103,2.558421,-0.671173,1.302656,0.794387,2.444845,1.245747,2.187237,4.690416,Senbatsu,CGM48_C,0,1
3,4,2.076906,0.615752,0.377631,2.752903,3,2.092039,8,BNK48,BNK48_2,0,1.879333,10,2.700815,1.147886,1.139097,1.542316,1,1.818642,-0.713740,2.392612,0.249461,1.620089,2.219051,1.236575,1.611175,-0.426401,Senbatsu,BNK48_BIII,1,1
4,5,1.290998,0.317298,0.250918,2.091426,28,-0.278001,21,BNK48,BNK48_3,0,-1.157973,43,-0.236737,0.229042,0.270494,-0.156922,1,0.567103,-0.321081,0.218980,-0.178768,-0.270013,0.044818,-0.386919,-0.117013,1.279204,Senbatsu,BNK48_NV,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,60,-0.716973,-1.093201,-1.117581,-0.967135,49,-0.621401,49,BNK48,BNK48_4,0,-0.550512,49,-0.541820,-0.820454,-0.807014,-0.432675,1,-1.310204,-0.975513,-0.671173,-0.570381,-0.861237,-1.050635,-1.022773,-0.981106,-0.426401,Unranked,BNK48_Trainee,0,0
60,61,-0.726735,-1.105466,-1.050000,-1.018120,49,-0.621401,49,BNK48,BNK48_4,0,-1.765434,49,-0.541820,-0.820454,-0.807014,-0.432675,1,-1.310204,-0.975513,-0.671173,-0.543694,-0.863028,-1.232591,-0.900271,-0.981106,-0.426401,Unranked,BNK48_Trainee,0,0
61,62,-0.733535,-0.782482,-0.661414,-1.096761,49,-0.621401,49,CGM48,CGM48_2,0,-1.157973,49,-0.541820,-0.820454,-0.807014,-0.432675,1,-1.310204,-0.975513,-0.671173,-0.529420,-0.833270,-1.119178,-1.103327,-0.981106,-0.426401,Unranked,CGM48_Trainee,0,0
62,63,-0.742050,-0.917399,-0.754337,-1.120343,49,-0.621401,49,CGM48,CGM48_2,0,-0.246781,49,-0.541820,-0.820454,-0.807014,-0.432675,1,-1.310204,-0.975513,-0.671173,-0.583414,-0.888972,-1.234175,-1.086260,-0.981106,-0.426401,Unranked,CGM48_Trainee,0,0


### Column Types Post-Preprocessing
- **Continuous Variables:** `GE4_Token`, `GE4_Transaction`, `GE4_Wallet`, `GE4_Token/Transaction`, `GE4_Token_Prelim1`, `GE3_Token`, `GE3_Transactions`, `GE3_Wallet`, `GE3_Token/Transaction`, `Request_Hour`, `Game_Caster`, `Theater_Stage`, `iAM48_Kami`, `iAM48_Oshi`, `iAM48_Cookies`, `iAM48_Likes`, `Setbatsu_Total`, `Center_Main_Total`, `Age`
- **Binary Variables:** `GE1_Par`, `GE2_Par`, `GE3_Center`, `Team_Position`
- **Categorical Variables:** `Band`, `Band_Gen`, `Band_Team` (derived from original `Band` and `Team`)
- **Original Variables:** `GE4_Rank`, `GE4_Prelim1`, `GE4_Prelim2`, and `GE3_Rank`

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 31 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   GE4_Rank               64 non-null     int64  
 1   GE4_Token              64 non-null     float64
 2   GE4_Transaction        64 non-null     float64
 3   GE4_Wallet             64 non-null     float64
 4   GE4_Token/Transaction  64 non-null     float64
 5   GE4_Prelim1            64 non-null     int64  
 6   GE4_Token_Prelim1      64 non-null     float64
 7   GE4_Prelim2            64 non-null     int64  
 8   Band                   64 non-null     object 
 9   Band_Gen               64 non-null     object 
 10  Team_Position          64 non-null     int64  
 11  Age                    64 non-null     float64
 12  GE3_Rank               64 non-null     int64  
 13  GE3_Token              64 non-null     float64
 14  GE3_Transactions       64 non-null     float64
 15  GE3_Wall

<a id='6'></a>

## 6. Predictive Model Building

In [None]:
# Re-importing necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Separating features and target variables for both tasks
X = df.drop(['GE4_Rank', 'GE4_Position'], axis=1)
y_rank = df['GE4_Rank']  # For regression
y_position = df['GE4_Position']  # For classification

# Handling missing values and encoding categorical variables
# Identifying categorical columns
categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns

# Creating preprocessing pipelines
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combining preprocessing steps into a single transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_cols),
        ('cat', categorical_pipeline, categorical_cols)
    ])

# Applying preprocessing to feature data
X_preprocessed = preprocessor.fit_transform(X)

# Splitting data into training and testing sets for both tasks
X_train_rank, X_test_rank, y_train_rank, y_test_rank = train_test_split(X_preprocessed, y_rank, test_size=0.2, random_state=42)
X_train_position, X_test_position, y_train_position, y_test_position = train_test_split(X_preprocessed, y_position, test_size=0.2, random_state=42)

# Building and training the models
# Regression model for GE4_Rank
model_rank = RandomForestRegressor(random_state=42)
model_rank.fit(X_train_rank, y_train_rank)

# Classification model for GE4_Position
model_position = RandomForestClassifier(random_state=42)
model_position.fit(X_train_position, y_train_position)

# Model building and training are complete
"Models are built and trained."

'Models are built and trained.'

In [None]:
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report
import numpy as np

# Predicting and evaluating the regression model (GE4_Rank)
y_pred_rank = model_rank.predict(X_test_rank)
rmse_rank = np.sqrt(mean_squared_error(y_test_rank, y_pred_rank))

# Predicting and evaluating the classification model (GE4_Position)
y_pred_position = model_position.predict(X_test_position)
accuracy_position = accuracy_score(y_test_position, y_pred_position)
classification_report_position = classification_report(y_test_position, y_pred_position)

# Results
rmse_rank, accuracy_position, classification_report_position

(1.5791818719246316,
 0.8461538461538461,
 '              precision    recall  f1-score   support\n\n  Next Girls       0.75      1.00      0.86         3\n    Senbatsu       0.80      1.00      0.89         4\n Under Girls       1.00      0.50      0.67         2\n    Unranked       1.00      0.75      0.86         4\n\n    accuracy                           0.85        13\n   macro avg       0.89      0.81      0.82        13\nweighted avg       0.88      0.85      0.84        13\n')

In [None]:
# Getting feature names after one-hot encoding
feature_names = list(preprocessor.transformers_[1][1]['onehot'].get_feature_names(categorical_cols))
feature_names.extend(numerical_cols)

# Extracting feature importances from both models
importances_rank = model_rank.feature_importances_
importances_position = model_position.feature_importances_

# Creating a DataFrame for better visualization
importances_rank_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances_rank})
importances_position_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances_position})

# Sorting the DataFrame based on importance
importances_rank_df = importances_rank_df.sort_values(by='Importance', ascending=False)
importances_position_df = importances_position_df.sort_values(by='Importance', ascending=False)

importances_rank_df.head(), importances_position_df.head()

AttributeError: 'OneHotEncoder' object has no attribute 'get_feature_names'

<a id='6-1'></a>

### 6.1 `GE4_Rank` Regression Model

<a id='6-1-1'></a>

#### 6.1.1 Regression Analysis

<a id='6-2'></a>

### 6.2 `Position` Classification Model

<a id='6-2-1'></a>

#### 6.2.2 Logistic Regression


<a id='6-2-2'></a>

#### 6.2.2 XGBoost


<a id='6-2-3'></a>

#### 6.2.3 LightGBM

<a id='6-2-4'></a>

#### 6.2.4 Ensemble

<a id='6-3'></a>

### 6.3 Hyperparameter Tuning (Optuna)

<a id='7'></a>

## 7. Feature Importance Analysis

<a id='8'></a>

## 8. Conclusion and Insights