**Task**:   
To help Trell predict the age group of users based on their activity on social media by creating a classification prediction algorithm and conducting a performance evaluation metric of the model.    

I use a **logistic regression** using **one-vs-rest multiple-class** classification algorithm and **weighted F1 score** to evaluate the model.

In [None]:
# Python 3 environment with analytics libraries installed
# as defined by the kaggle/python Docker 

import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#assigning CSV files as pandas dataframes
sample = pd.read_csv('../input/trell-social-media-usage-data/sample_submission.csv')
testDF = pd.read_csv('../input/trell-social-media-usage-data/test_age_dataset.csv')
train = pd.read_csv('../input/trell-social-media-usage-data/train_age_dataset.csv')

In [None]:
#The dataframe with the bulk of the data
##Use this to train a prediction model
train

In [None]:
#use this dataset to test model, make predictions, 
#and compare to the train dataset prediction
testDF

First: Explore and evaluate the datasets by cheking for NaNs, get a descrpition of the data, explore the age_group column in the train dataset.   

In [None]:
#making sure the data contains no NaNs
train.isnull().values.any()

In [None]:
testDF.isnull().values.any()

Neither dataset had NaNs.       
Classification data is in numeric value.   
The train dataset includes the age group classification. The test does not. This means that the datasets are ready for use.

In [None]:
#Explore the data
train.describe()

In [None]:
testDF.describe()

The **'age_group' column** in the train dataset is the classifications stardard for the prediction model. The column data is a float datatype, with 4 possible classes: 1, 2, 3, 4. The dataset does not specify the definition of each group.   

The bulk of Trell users fall under age_group 1.

In [None]:
#get to know age_group
train['age_group'].describe()

In [None]:
train['age_group'].value_counts()

In [None]:
#visualize age_group column
import matplotlib.pyplot as plt
%matplotlib inline
train.age_group.value_counts().plot(kind='bar', color='dodgerblue')

Correlation among columns:

In [None]:
correlation=train.corr()
correlation

In [None]:
import seaborn as sns
sns.heatmap(correlation, cmap="Reds")

A logistic regression multiple classification model using the **One-Vs-Rest** heuristic method to split the multiple classes into multiple binary classification datasets and train a binary classification each one.



In [None]:
# logistic regression for multi-class classification using built-in one-vs-rest
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# define dataset
X, y = make_classification(n_samples=2000, n_features=26, n_informative=13, n_redundant=13, n_classes=4, random_state=1)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)

In [None]:
#accuracy from training dataset
model.score(X, y)

In [None]:
#accuracy from test dataset
model.score(X, yhat)

In [None]:
from sklearn.model_selection import train_test_split as split
from sklearn import metrics
from sklearn.model_selection import cross_val_score

In [None]:
# evaluate the model by splitting the data-set into train and test sets
X_train, X_test, y_train, y_test = split(X, y, test_size=0.3)

model2 = LogisticRegression()
model2.fit(X_train, y_train)

In [None]:
predicted = model2.predict(X_test)
print(y_test)
predicted

In [None]:
prediction1= pd.DataFrame(predicted)
prediction1.rename(columns = {0:'prediction'})

In [None]:
#a histogram of the residuals from the model
plt.hist(y_test - predicted, color='salmon')

In [None]:
# generate class probabilities
probs = model2.predict_proba(X_test)
probs

In [None]:
# generate evaluation metrics
print(metrics.accuracy_score(y_test, predicted))

In [None]:
conf_matrix = metrics.confusion_matrix(y_test, predicted)
sns.heatmap(conf_matrix, annot=True,cmap='Greens')

In [None]:
print(metrics.classification_report(y_test, predicted))

In [None]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
scores, scores.mean()

In [None]:
##adjusted model
#lowered the n_samples and random_state to 0

X, y = make_classification(n_samples=1000, n_features=26, n_informative=13, n_redundant=13, n_classes=4, random_state=0)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)

In [None]:
model.score(X, y)

In [None]:
##the one above fits less
#redo the adjustments
#same n_samples and random_state to 1

X, y = make_classification(n_samples=1000, n_features=26, n_informative=13, n_redundant=13, n_classes=4, random_state=1)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)

In [None]:
model.score(X, y)

In [None]:
##the one above fits better but not better than the 1st model
#redo the adjustments

#larger n_samples and random_state to 1

X, y = make_classification(n_samples=3000, n_features=26, n_informative=13, n_redundant=13, n_classes=4, random_state=1)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)

In [None]:
model.score(X, y)

In [None]:
#redo the adjustments

#same n_samples and random_state to 2

X, y = make_classification(n_samples=3000, n_features=26, n_informative=13, n_redundant=13, n_classes=4, random_state=2)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)

In [None]:
model.score(X, y)

In [None]:
#the model above is better than the previous ones
#still adjust to seek better scores

#same n_samples and random_state to 3

X, y = make_classification(n_samples=3000, n_features=26, n_informative=13, n_redundant=13, n_classes=4, random_state=3)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)

In [None]:
model.score(X, y)

In [None]:
#The model above did not fit better
#retry with a larger sample but leave random_state at 2
X, y = make_classification(n_samples=4000, n_features=26, n_informative=13, n_redundant=13, n_classes=4, random_state=2)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)

In [None]:
model.score(X, y)

In [None]:
#the previous model specification did not improve the score
#the model works better at 2000 sample size and random_state at 2
#as below

X, y = make_classification(n_samples=2000, n_features=26, n_informative=13, n_redundant=13, n_classes=4, random_state=2)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)
model.score(X, y)

In [None]:
# evaluate the model by splitting the data-set into train and test sets
#increased the test_size
X_train, X_test, y_train, y_test = split(X, y, test_size=0.4)

model3 = LogisticRegression()
model3.fit(X_train, y_train)

In [None]:
predicted = model3.predict(X_test)
print(y_test)
predicted

In [None]:
#a histogram of the residuals from the model
plt.hist(y_test - predicted, color='purple')

# generate class probabilities
probs = model3.predict_proba(X_test)
probs

In [None]:
# generate evaluation metrics
print(metrics.accuracy_score(y_test, predicted))

In [None]:
print(metrics.classification_report(y_test, predicted))

In [None]:
conf_matrix = metrics.confusion_matrix(y_test, predicted)
sns.heatmap(conf_matrix, annot=True,cmap='magma')

In [None]:
#increased the test_size one more time
#for the model that fit best of the above tries
X_train, X_test, y_train, y_test = split(X, y, test_size=0.5)

model4 = LogisticRegression()
model4.fit(X_train, y_train)

In [None]:
predicted = model4.predict(X_test)
print(y_test)
predicted

In [None]:
#a histogram of the residuals from the model
plt.hist(y_test - predicted, color='gray')

In [None]:
# generate class probabilities
probs = model4.predict_proba(X_test)
probs

In [None]:
# generate evaluation metrics
print(metrics.accuracy_score(y_test, predicted))

In [None]:
print(metrics.classification_report(y_test, predicted))

In [None]:
conf_matrix = metrics.confusion_matrix(y_test, predicted)
sns.heatmap(conf_matrix, annot=True,cmap='mako')

In [None]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
scores, scores.mean()

In [None]:
prediction2= pd.DataFrame(predicted)
prediction2.rename(columns = {0:'prediction'})
prediction2.to_csv('predictionDF.csv',index=False)

After several adjustments to the first model ran, prediction2 results fit best at 71% weighted f1 score.      
Other coders could persue other approaches than logistic regression or classification approach.