# Social Media Homework
## Part 1

In [17]:
import numpy as np
import pandas as pd
import copy

In [18]:
train_raw = pd.read_csv('train.csv')

In [19]:
train_raw.columns

Index(['Choice', 'A_follower_count', 'A_following_count', 'A_listed_count',
       'A_mentions_received', 'A_retweets_received', 'A_mentions_sent',
       'A_retweets_sent', 'A_posts', 'A_network_feature_1',
       'A_network_feature_2', 'A_network_feature_3', 'B_follower_count',
       'B_following_count', 'B_listed_count', 'B_mentions_received',
       'B_retweets_received', 'B_mentions_sent', 'B_retweets_sent', 'B_posts',
       'B_network_feature_1', 'B_network_feature_2', 'B_network_feature_3'],
      dtype='object')

In [20]:
train_raw['Choice'].mean()

0.5094545454545455

In [21]:
train_df = copy.deepcopy(train_raw[['Choice']])

To reduce the amount of columns initially, we decided to subtract B’s values from A.

In [22]:
train_df['follower_count'] = train_raw['A_follower_count'] - train_raw['B_follower_count']
train_df['following_count'] = train_raw['A_following_count'] - train_raw['B_following_count']
train_df['listed_count'] = train_raw['A_listed_count'] - train_raw['B_listed_count']
train_df['mentions_received'] = train_raw['A_mentions_received'] - train_raw['B_mentions_received']
train_df['retweets_received'] = train_raw['A_retweets_received'] - train_raw['B_retweets_received']
train_df['mentions_sent'] = train_raw['A_mentions_sent'] - train_raw['B_mentions_sent']
train_df['retweets_sent'] = train_raw['A_retweets_sent'] - train_raw['B_retweets_sent']
train_df['posts'] = train_raw['A_posts'] - train_raw['B_posts']
train_df['network_feature_1'] = train_raw['A_network_feature_1'] - train_raw['B_network_feature_1']
train_df['network_feature_2'] = train_raw['A_network_feature_2'] - train_raw['B_network_feature_2']
train_df['network_feature_3'] = train_raw['A_network_feature_3'] - train_raw['B_network_feature_3']

In [23]:
train_df.corr().unstack().sort_values(ascending=False).head(20)

network_feature_3  network_feature_3    1.000000
network_feature_2  network_feature_2    1.000000
follower_count     follower_count       1.000000
following_count    following_count      1.000000
listed_count       listed_count         1.000000
mentions_received  mentions_received    1.000000
retweets_received  retweets_received    1.000000
mentions_sent      mentions_sent        1.000000
retweets_sent      retweets_sent        1.000000
posts              posts                1.000000
network_feature_1  network_feature_1    1.000000
Choice             Choice               1.000000
retweets_received  mentions_received    0.988363
mentions_received  retweets_received    0.988363
retweets_received  network_feature_1    0.920574
network_feature_1  retweets_received    0.920574
                   mentions_received    0.914479
mentions_received  network_feature_1    0.914479
listed_count       follower_count       0.781208
follower_count     listed_count         0.781208
dtype: float64

Next, we removed mentions_received and network_feature_1 (degree) because both of them were highly correlated with retweets_received. We then split our data into a train (75%) and test set and transformed all data to be between 0 and 1.

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# drop columns
train_df = train_df.drop(columns=['mentions_received', 'network_feature_1'])
y = train_df['Choice']
X = train_df.drop(columns=['Choice'])

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 110)

# scale the data
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Next, we trained a logistic regression model on the training set and tested it on the remaining data. The results as well as the confusion matrices are below.

In [28]:
import statsmodels.api as sm
logit = sm.Logit(y_train, X_train_scaled)
result = logit.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.610310
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                 Choice   No. Observations:                 4125
Model:                          Logit   Df Residuals:                     4116
Method:                           MLE   Df Model:                            8
Date:                Tue, 11 Feb 2020   Pseudo R-squ.:                  0.1192
Time:                        13:09:50   Log-Likelihood:                -2517.5
converged:                       True   LL-Null:                       -2858.3
Covariance Type:            nonrobust   LLR p-value:                6.909e-142
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            11.0879      2.221      4.992      0.000       6.734      15.441
x2            -0.6939      1.

In [29]:
from sklearn.metrics import confusion_matrix

def make_predictions(s):
    if s >= 0.5:
        return 1
    return 0

train_predictions = pd.Series(result.predict(X_train_scaled))
train_predictions = train_predictions.map(make_predictions)
train = confusion_matrix(y_train, train_predictions)

test_predictions = pd.Series(result.predict(X_test_scaled))
test_predictions = test_predictions.map(make_predictions)
test = confusion_matrix(y_test, test_predictions)

print("Training set confustion matrix:\n", train)
print("\nTest set confustion matrix:\n", test)

print("\nTraining set accuracy:", (train[0][0] + train[1][1])/(sum(train[0])+ sum(train[1])))
print("Test set accuracy:", (test[0][0] + test[1][1])/(sum(test[0])+ sum(test[1])))

Training set confustion matrix:
 [[1319  699]
 [ 488 1619]]

Test set confustion matrix:
 [[451 229]
 [149 546]]

Training set accuracy: 0.7122424242424242
Test set accuracy: 0.7250909090909091


We then sorted the variables by the magnitude of their coefficients

In [30]:
out = pd.DataFrame({'coefficients': result.params.values, 'p-values': result.pvalues.values}, index = X_train.columns)
out['magnitude'] = out['coefficients'].abs()
out.sort_values(by = ['magnitude'], ascending = False, inplace = True)
out

Unnamed: 0,coefficients,p-values,magnitude
retweets_received,-75.384021,8.481768e-74,75.384021
listed_count,58.128448,1.94413e-33,58.128448
follower_count,11.087859,5.983042e-07,11.087859
network_feature_3,2.85837,0.0001147292,2.85837
retweets_sent,1.913395,0.0009614698,1.913395
mentions_sent,1.754485,0.00187134,1.754485
posts,1.206705,0.05602249,1.206705
following_count,-0.693924,0.4940719,0.693924
network_feature_2,0.543859,0.4967619,0.543859


Next, we took the top four important variables and scaled them to where they sum to 1. These are the variables and weights we use in part 2.

In [31]:
out['coefficients'].iloc[:4] / abs(out['coefficients'].iloc[:4].sum())

retweets_received   -22.779148
listed_count         17.564949
follower_count        3.350471
network_feature_3     0.863727
Name: coefficients, dtype: float64

From our model, the best predictors of influence are retweets_received, listed_count, and follower_count. A surprise is that retweet_ received is negative, meaning that more retweets are related to a lesser influencer status. On the other hand, listed_count and follower_count are both positive, which make sense. A business could use these results by focusing primarily on how many followers their marketers have to identify who will be more influential.

### Financial Value Calculations

**Expected return without analytics:**

0.01% * $10/follower - 2 * $5 = $0.10/follower - $10

**Expected return with our analytics:**

72.5% * 0.015% * $10/follower - $10 = $0.10875/follower - $10

**Increase over no analytics: $0.00875/follower**

**Perfect analytical model:**

100% * 0.015% * $10/follower - $10 = $0.15/follower - $10


**Increase over no analytics: $0.05/follower**

A perfect model adds 5 cents per follower over doing no analytics compared to our model which adds 0.875 cents per follower. Therefore it becomes clear that tweaking this model further to be more accurate could further increase the financial value of our model.