# How Many Followers Will You Get 

## Introduction

This notebook focuses on prediction on the number of followers taking kernel votes, forum message votes, performance tiers and days since registration into consideration.

A linear regression model is adopted to forecast how many followers a Kaggler will get. The source data is obtained by merging useful information from meta-kaggle files about users, kernels and forums.

### Part One

Select necessary information from multiple meta-kaggle files by dropping columns and summing up some values.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns 
import matplotlib.pyplot as plt

#### 1. Get information from files on users

In [None]:
df_followers=pd.read_csv('/kaggle/input/meta-kaggle/UserFollowers.csv').drop(['Id'],axis=1)
user_followers = df_followers.groupby('UserId')['FollowingUserId'].count().sort_values(ascending = False).reset_index().round(1)
user_followers=user_followers.rename(columns={'FollowingUserId':'FollowersNum'})
user_followers.head()

In [None]:
df_users=pd.read_csv('/kaggle/input/meta-kaggle/Users.csv').drop(['DisplayName'],axis=1)
df_users = df_users.rename(columns={'Id':'UserId',})
df_users.head()

#### 2. Get information from files on kernels and forums

In [None]:
df_kernels_votes=pd.read_csv('/kaggle/input/meta-kaggle/KernelVotes.csv').drop(['VoteDate'],axis=1)
kernel_votes=df_kernels_votes.groupby('KernelVersionId')['UserId'].count().sort_values(ascending = False).reset_index().round(1)
kernel_votes=kernel_votes.rename(columns={'UserId':'KernelVotesNum'})
kernel_votes.head()

In [None]:
df_kernels=pd.read_csv('/kaggle/input/meta-kaggle/Kernels.csv').drop(['Id','ForkParentKernelVersionId','ForumTopicId','FirstKernelVersionId','CreationDate','EvaluationDate','MadePublicDate','IsProjectLanguageTemplate'],axis=1)

In [None]:
df_kernels = df_kernels.rename(columns={'CurrentKernelVersionId':'KernelVersionId'})
df_kernels=df_kernels.drop(['MedalAwardDate','CurrentUrlSlug'],axis=1)
df_kernels.head()

In [None]:
df_forums=pd.read_csv('/kaggle/input/meta-kaggle/ForumMessageVotes.csv').drop(['ForumMessageId','FromUserId','VoteDate'],axis=1)
user_forums=df_forums.groupby('ToUserId')['Id'].count().sort_values(ascending = False).reset_index().round(1)
user_forums = user_forums.rename(columns={'ToUserId':'UserId','Id':'ForumVotesNum'})
user_forums.head()

#### 3.Merge dataframes on users

In [None]:
full_data_user= df_users.merge(user_followers, on = ['UserId'],how = 'right')
full_data_user.head()

In [None]:
full_data_user=full_data_user.merge(user_forums,on=['UserId'],how='right')
full_data_user.head()

#### 4. Merge dataframes on kernels

In [None]:
full_data_kernel=kernel_votes.merge(df_kernels, on = ['KernelVersionId'],how = 'right')
full_data_kernel.head()

In [None]:
full_data_kernel=full_data_kernel.rename(columns={'AuthorUserId':'UserId'})
full_data_kernel.head()

#### 5. Merge dataframes on users and kernels

In [None]:
full_data=full_data_user.merge(full_data_kernel, on = ['UserId'],how = 'right')
full_data.head()

#### 6. Sum up kernelvotes, forum votes and the number of followers for each user

In [None]:
full_data_1=full_data.groupby('UserId')['KernelVotesNum'].sum()
full_data_1.head()

In [None]:
full_data_2=full_data.groupby('UserId')['ForumVotesNum'].sum()
full_data_2.head()

In [None]:
full_data_3=full_data.groupby('UserId')['FollowersNum'].sum()
full_data_3.head()

In [None]:
full_data_1=full_data_1.to_frame()
# type(full_data_1)
full_data_2=full_data_2.to_frame()
full_data_3=full_data_3.to_frame()

In [None]:
result_dataset=full_data_1.merge(full_data_2,on=['UserId'],how='right')
result_dataset.head()

In [None]:
result_dataset=result_dataset.merge(full_data_3,on=['UserId'],how='right')
result_dataset.head()

#### The dataframe with necessary attributes only

In [None]:
result_dataset_1=df_users.merge(result_dataset, on = ['UserId'],how = 'right')
result_dataset_1.head()

#### 7. Drop out the null value in the column of "PerformanceTier"

In [None]:
result_dataset_1=result_dataset_1.dropna(subset=['PerformanceTier'])
result_dataset_1.head()

#### 8. Reduce dataset by filtering out the zero values in the columns of "KernelVotesNum" and "ForumVotesNum"

In [None]:
result_dataset_2=result_dataset_1.where((result_dataset_1.KernelVotesNum!=0)& (result_dataset_1.ForumVotesNum!=0))
result_dataset_2=result_dataset_2.dropna()
result_dataset_2.head()

#### 9. Get the attributed of "DaysSinceRegistration"

In [None]:
import datetime as dt
from datetime import date, timedelta

def get_days_from_registration(row):
    '''Function returns number of days since users registration date'''
    
    today = dt.datetime.now().date()
    days = (today - dt.datetime.strptime(row['RegisterDate'], "%m/%d/%Y").date()).days
    
    return days


In [None]:
result_dataset_2['DaysSinceRegistration'] = result_dataset_2.apply(lambda row: get_days_from_registration(row),axis=1)
result_dataset_2.head()

#### The final dataset for training and evaluating the regression model

In [None]:
result_dataset_2=result_dataset_2.drop(['RegisterDate'],axis=1)
result_dataset_2.head()

### Part Two

After preprocessed by max-min normalization and reshaping, source data is split into training set and testing set according to a proportion of 3:7.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
y_data=result_dataset_2[['FollowersNum']]
x_data=result_dataset_2.drop(['FollowersNum','UserName','UserId'], axis=1)

In [None]:
y_data_min=np.min(y_data)
y_data_max=np.max(y_data)
x_data_min=np.min(x_data)
x_data_max=np.max(x_data)

In [None]:
x_data_max

In [None]:
# normalization
x = (x_data -x_data_min)/(x_data_max-x_data_min)
y = (y_data -y_data_min)/(y_data_max-y_data_min)
# x= x_data
# y= y_data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
x_train.head()

In [None]:
x_train = x_train.T
x_test = x_test.T
y_train = y_train.T
y_test = y_test.T

print("x train: ",x_train.shape)
print("x test: ",x_test.shape)
print("y train: ",y_train.shape)
print("y test: ",y_test.shape)

In [None]:
x_train_1=x_train.values.reshape(-1,4)
y_train_1=y_train.values.reshape(-1,1)
x_test_1=x_test.values.reshape(-1,4)
y_test_1=y_test.values.reshape(-1,1)

### Part Three

A linear regression model is set up and manipulated by keras 

In [None]:
import tensorflow
import keras
import csv
import numpy
import matplotlib.pyplot as plot

In [None]:
def show_train_history(train_history, x1, x2):
    plot.plot(train_history.history[x1])
    plot.plot(train_history.history[x2])
    plot.title('Train History')
    plot.ylabel('train')
    plot.xlabel('Epoch')
    plot.legend([x1, x2], loc = 'upper right')
    plot.show()

In [None]:
from keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience=10)

In [None]:
#Initialize parameters of the model
numpy.random.seed(0)
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_dim = 4, activation = 'linear'))
# Set up methods of training the model
model.compile(loss='mean_squared_error', optimizer = 'sgd', metrics = ['mae'])

In [None]:
model.summary()

In [None]:
train_history = model.fit(x_train_1, y_train_1, epochs = 500, batch_size = 5, validation_split=0.1,callbacks=[early_stopping])

In [None]:
show_train_history(train_history,'loss','val_loss')
show_train_history(train_history,'mae','val_mae')

#### Evaluate the trained model. 
By contrast to intuition,  kernel votes and days since registration have negative impacts on the number of followers, and the other two attributes have little positive impacts on attracting more followers. This result indicate that there is no a simple linear relation between the four attributes and how many followers a Kaggler wiil get.

In [None]:
scores = model.evaluate(x_test_1,y_test_1)
# print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
print("\n%s: %.2f" % (model.metrics_names[1], scores[1]*(y_data_max-y_data_min)))
print(model.get_weights())

### Part Four

Try to predict the number of followers feeding some data into the model

In [None]:
x_customized_data = np.array([1,100,100,365]) # data format:['PerformanceTier','KernelVotesNum','ForumVotesNum','DaysSinceRegistration'] 
x_customized_test = (x_customized_data -x_data_min)/(x_data_max-x_data_min) #Normalized inputs
x_customized_test=x_customized_test.values.reshape(-1,4) #Reshape

#### Get the forecasted number of followers from the model

In [None]:
predict_votes = model.predict(x_customized_test)
predict_votes

#### As the result of predicted number below, Kagglers who have performance tier of 1, 100 kernel votes and forum message votes respectively, and have stayed in Kaggle community for one year, will get around 65 followers

In [None]:
predict_votes[0][0]*(y_data_max-y_data_min)+y_data_min # Rescale models outputs into legible data

#### If a Kaggler, achieving 5 performance tier and staying in the community for 730 days, have 5000 votes on his/her kernel or forum message, he/she will get 168 followers.

In [None]:
x_customized_data = np.array([5,5000,5000,730]) # data format:['PerformanceTier','KernelVotesNum','ForumVotesNum','DaysSinceRegistration'] 
x_customized_test = (x_customized_data -np.min(x_data))/(np.max(x_data)-np.min(x_data)) #Normalized inputs
x_customized_test=x_customized_test.values.reshape(-1,4) #Reshape
predict_votes = model.predict(x_customized_test)
# predict_votes
predict_votes[0][0]*(y_data_max-y_data_min)+y_data_min # Rescale models outputs into legible data

## Conclusion

The experiment result shows the effects of these features briefly. However, the model adopted here is just a simple linear regression one, so the correlation between these attributes and the number of followers cannot be well exploited. Models with more complex structures and more parameters need to be studied on this topic.