# <center>Classifying fashion images on the MNIST data<center>
    
<center>Created by Zsófia Rebeka Katona<center>
<center>Data Science 2 - Kaggle competition<center>


## Introduction
---

The goal of this challenge is to predict which articles are shared the most on social media. The data comes from the website mashable.com as of the beginning of 2015. The dataset used in the competition can be found in the UCI repository.

- You will find the training and test data in the data section of the competition, along with a description of the features. - You will need to build models on the training data and make predictions on the test data and submit your solutions to Kaggle. You will also find a sample solution file in the data section that shows the format you will need to use for your own submissions.
- The deadline for Kaggle solutions is 8PM on 19 April. You will be graded primarily on the basis of your work and how clearly you explain your methods and results. Those in the top three in the competition will receive some extra points. I expect you to experiment with all the methods we have covered: linear models, random forest, gradient boosting, neural networks + parameter tuning, feature engineering.
- You will see the public score of your best model on the leaderboard. A private dataset will be used to evaluate the final performance of your model to avoid overfitting based on the leaderboard.
- You should also submit to Moodle the documentation (ipynb and pdf) of your work, including exploratory data analysis, data cleaning, parameter tuning and evaluation. Aim for concise explanations.
- Feel free to ask questions about the task in Slack. The Kaggle competition is already open, please start working on it and submitting solutions (you cannot submit more than 5 solutions per day).

## Data import
---

In [3]:
# Importing required libraries
import os
import pandas as pd
import numpy as np

In [6]:
# Importing the training and the test set
current_dir = os.getcwd()
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# Checking the attributes of the sets
print(f"The shape of the train set is: {train_df.shape}.")
print(f"The shape of the test set is {test_df.shape}.")
print("The data types of the train set:")
train_df.info()

The shape of the train set is: (29733, 61).
The shape of the test set is (9911, 60).
The data types of the train set:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29733 entries, 0 to 29732
Data columns (total 61 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   timedelta                      29733 non-null  int64  
 1   n_tokens_title                 29733 non-null  int64  
 2   n_tokens_content               29733 non-null  int64  
 3   n_unique_tokens                29733 non-null  float64
 4   n_non_stop_words               29733 non-null  float64
 5   n_non_stop_unique_tokens       29733 non-null  float64
 6   num_hrefs                      29733 non-null  int64  
 7   num_self_hrefs                 29733 non-null  int64  
 8   num_imgs                       29733 non-null  int64  
 9   num_videos                     29733 non-null  int64  
 10  average_token_length           29733 non-null  f

In [7]:
train_df.head(10)

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,is_popular,article_id
0,594,9,702,0.454545,1.0,0.620438,11,2,1,0,...,1.0,-0.153395,-0.4,-0.1,0.0,0.0,0.5,0.0,0,1
1,346,8,1197,0.470143,1.0,0.666209,21,6,2,13,...,1.0,-0.308167,-1.0,-0.1,0.0,0.0,0.5,0.0,0,3
2,484,9,214,0.61809,1.0,0.748092,5,2,1,0,...,0.433333,-0.141667,-0.2,-0.05,0.0,0.0,0.5,0.0,0,5
3,639,8,249,0.621951,1.0,0.66474,16,5,8,0,...,0.5,-0.5,-0.8,-0.4,0.0,0.0,0.5,0.0,0,6
4,177,12,1219,0.397841,1.0,0.583578,21,1,1,2,...,0.8,-0.441111,-1.0,-0.05,0.0,0.0,0.5,0.0,0,7
5,568,7,126,0.723577,1.0,0.774194,3,3,1,0,...,0.285714,0.0,0.0,0.0,0.454545,0.136364,0.045455,0.136364,0,8
6,318,12,1422,0.367994,1.0,0.469256,28,28,26,0,...,0.7,-0.234167,-0.5,-0.05,1.0,0.1,0.5,0.1,0,9
7,582,6,1102,0.451287,1.0,0.642089,7,3,1,0,...,0.8,-0.15163,-0.4,-0.05,0.8,0.4,0.3,0.4,1,11
8,269,9,0,0.0,0.0,0.0,0,0,5,0,...,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.5,0,12
9,567,7,94,0.755319,1.0,0.8125,8,6,0,11,...,1.0,-0.183333,-0.2,-0.166667,0.0,0.0,0.5,0.0,0,14


#### Train-test split

In [None]:
from sklearn.model_selection import train_test_split

# Dropping the target variable
features = train_df.drop(columns=["is_popular"])
label = train_df["is_popular"]

# Setting the random state
prng = np.random.RandomState(20240419)

# Splitting the fata
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=prng)

#### Feature engineering

In [None]:
# Creating the feature engineerined dataset
def extract_dt_features(df_with_datetime):
    df_with_datetime['timedelta'] = pd.to_datetime(df_with_datetime['timdelta'], utc=True)
    df_with_datetime['month'] = df_with_datetime['datetime'].dt.month
    df_with_datetime['day'] = df_with_datetime['datetime'].dt.day
    
# Adding the total number of media elements in each post (links, videos, images)
train_df['total_multimedia'] = train_df['num_hrefs'] + train_df['num_self_hrefs'] + train_df['num_imgs'] + train_df['num_videos']
    
# Extracting the features
extract_dt_features(train_df)

# Dropping unnecessary columns
feature_matrix = bike_data.drop(columns=["count", "registered", "casual"]).select_dtypes(include=np.number)

# We label the count column
label = bike_data["count"]

# Setting the random pseudo state again
prng = np.random.RandomState(20240306)
# Splitting the fe training set and test set again
X_train_fe, X_test_fe, y_train, y_test = train_test_split(feature_matrix, label, test_size=0.2, random_state=prng)

## Data cleaning
---

## Exploratory Data Analysis
---

## Predictive models
---

### Model 1: Linear models (OLS)
Using 6 different models

### Model 2: Linear models (Lasso)
Using the same 6 different models
Logit + lasso with CV

### Model 3: Decision Trees
and fearure engineered decision trees

### Model 4: RandomForest
with cross-validation +
or feature engineered RandomForest

### Model 5: Gradient Boosting
with cross-validation + or feature engineered GradientBoosting

## Neural networks
---

### Model 6
Simple fully connected layer network with dropout

### Model 7
Convolutional neural network with dropout and increased width

### Model 8
Convolutional neural network with dropout, increased width and increased depth

## Evaluation
---