[Cascade Cup Data Science Hackathon](https://dphi.tech/practice/challenge/46#problem)
# Problem Overview
Understanding the customers’ intentions can help to improve the journey, e.g., by taking shortcuts or giving recommendations to improve the overall experience is very important. With the extent of development in the field of machine learning research, personalization of services has become very common. Typically, a user’s intention on a Web site can be understood by looking at their past interactions. In concrete terms, this means that a user leaves a sequence of events about the history of his page views and interactions. An event can be that a user makes a search query, calls up an article page or receives an e-mail. This data forms the basis for working with the following techniques. Therefore, the first step is to collect or extract this data. This step has been done by Trell.

In this world of big data, Trell wants you to use the data to predict the age group of their users based on their activity on social media activities.  This will help them to divide their huge userbase and cater differently to each of them. Given this huge dataset, predict the age group of the users, the evaluation metric for the competition is the Weighted F1 score.

The Machine learning model you develop will help Trell provide better experience to their users by giving them a better user age specific content which people might find more relatable

In [1]:
import os
import numpy as np 
import pandas as pd 

SEED = 15
def all_seeds(s):
    os.environ['PYTHONHASHSEED'] = str(s)
    np.random.seed(s)
    
all_seeds(SEED)

# Data Description
## **Term:	Description**
1. **userId**:	Unique number given to each user.
2. **tier**:	Tier of the city in which the user is residing.
3. **gender**:	Categorical feature representing the gender of the user. 1 represents male and 2 represents female.
4. **following_rate**:	Number of accounts followed by the user(feature is normalized)
5. **followers_avg_age**:	Average of age groups of all the followers of the user.
6. **following_avg_age**:	Average of age groups of all the accounts followed by the user.
7. **max_repetitive_punc**:	Maximum repititive punctuations found in the bio and comments of the user.
8. **num_of_hashtags_per_action**:	Average nubmer of hashtags used by the user per comment.
9. **emoji_count_per_action**:	Average number of emojis used by the user per comment.
10. **punctuations_per_action**:	Average number of punctuations used by the user per comment.
11. **number_of_words_per_action**:	Average number of words used by the user per comment.
12. **avgCompletion**:	Average watch time completion rate of the videos.
13. **avgTimeSpent**:	Average time spent by the user on a video in seconds.
14. **avgDuration**:	Average duration of the videos that the user has watched till date.
15. **avgComments**:	Average number of comments per video watched.
16. **creations**:	Total number of videos uploaded by the user.
17. **content_views**:	Total number of videos watched.
18. **num_of_comments**:	Total number of comments made by the user (normalized)
19. **weekends_trails_watched_per_day**:	Number of videos watched on weekends per day.
20. **weekdays_trails_watched_per_day**:	Number of videos watched on weekdays per day.
21. **slot1_trails_watched_per_day**:	The day is divided into 4 slots. This feature represents the average number of videos watched in this particular time slot.
22. **slot2_trails_watched_per_day**:	The day is divided into 4 slots. This feature represents the average number of videos watched in this particular time slot.
23. **slot3_trails_watched_per_day**:	The day is divided into 4 slots. This feature represents the average number of videos watched in this particular time slot.
24. **slot4_trails_watched_per_day**:	data
25. **avgt2**:	Average number of followers of all the accounts followed by the user.
26. **age_group**:	This is a categorical feature denoting the age of the user. Age of users is divided into 4 groups, 1: <18y; 2: 18-24y; 3: 24-30y; 4: >30y

In [2]:
train_df = pd.read_csv('../input/casecade-cup-data-science-hackathon/train_age_dataset.csv')

#features
x_tr = train_df
x_tr = x_tr.drop(['age_group'],axis = 1)

#labels
y_tr = train_df[['age_group']]

test_df = pd.read_csv('../input/casecade-cup-data-science-hackathon/test_age_dataset.csv')

sample_df = pd.read_csv('../input/casecade-cup-data-science-hackathon/sample_submission.csv')

In [3]:
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
import xgboost as xgb 

N_FOLDS = 5
N_REPEAT = 2

def training(n_repeat = 1, n_folds = 5):
    models = []
    F1_scores = []
    kfold = KFold(n_folds, shuffle = True)
    
    for fold, (train_index, test_index) in enumerate(kfold.split(x_tr), 1):
        print('-'*85)
        print(f'Repeat {n_repeat}, Fold {fold}')
        
        X_train = x_tr.values[train_index]
        y_train = y_tr.values[train_index].ravel()
        X_test = x_tr.values[test_index]
        y_test = y_tr.values[test_index].ravel()
        
        model = xgb.XGBClassifier(n_estimators=300,
                                  booster='gbtree',
                                  eta=0.3,
                                  reg_lambda=3,
                                  max_depth=8,
                                  min_child_weight=10,
                                  random_state=8)
        model.fit(X_train, y_train)
        
        y_pred = model.predict(X_test)
        f1 = f1_score(y_test, y_pred, average="weighted")
        print(f'Weighted F1 score: {f1}')
        print(classification_report(y_test, y_pred, labels=[1, 2, 3, 4]))
        
        models.append(model)
        F1_scores.append(f1)
    return models, np.mean(F1_scores)

models = []
mean_f1s = []

for i in range(1, N_REPEAT+1):
    m, f = training(i, N_FOLDS)
    print('-'*85)
    models = models + m
    mean_f1s.append(f)

-------------------------------------------------------------------------------------
Repeat 1, Fold 1
Weighted F1 score: 0.8068331656363589
              precision    recall  f1-score   support

           1       0.93      0.94      0.94     61627
           2       0.64      0.78      0.70     11811
           3       0.57      0.66      0.61     12069
           4       0.65      0.35      0.46     12269

    accuracy                           0.81     97776
   macro avg       0.70      0.68      0.68     97776
weighted avg       0.81      0.81      0.81     97776

-------------------------------------------------------------------------------------
Repeat 1, Fold 2
Weighted F1 score: 0.8051483065378269
              precision    recall  f1-score   support

           1       0.93      0.94      0.94     61546
           2       0.64      0.78      0.70     11969
           3       0.56      0.65      0.60     11983
           4       0.63      0.35      0.45     12278

    accurac

In [4]:
print(f"Mean Weighted F1 score: {np.mean(mean_f1s)}")

Mean Weighted F1 score: 0.8056731453056012


In [5]:
from scipy import stats as s
test_pred = sample_df.copy()
test_pred.loc[:,['prediction']] = 0

pred = np.array([])

pred1=models[0].predict(test_df.values)
pred2=models[1].predict(test_df.values)
pred3=models[2].predict(test_df.values)
pred4=models[3].predict(test_df.values)
pred5=models[4].predict(test_df.values)
pred6=models[5].predict(test_df.values)
pred7=models[6].predict(test_df.values)
pred8=models[7].predict(test_df.values)
pred9=models[8].predict(test_df.values)
pred10=models[9].predict(test_df.values)


for i in range(0,len(test_df.values)):
    pred = np.append(pred, s.mode([pred1[i], pred2[i], pred3[i], pred4[i], pred5[i],
                                   pred6[i], pred7[i], pred8[i], pred9[i], pred10[i]
                                  ])[0])

test_pred.loc[:,['prediction']] = pred.reshape(pred.shape[0],1)
test_pred = test_pred.astype(int)
test_pred.to_csv('submission.csv', index=False)
test_pred

Unnamed: 0,prediction
0,1
1,1
2,1
3,3
4,1
...,...
54315,1
54316,1
54317,4
54318,1


- Public LB: 80.86377553392155
- Private LB: 80.9536106219112