### Modelo

* Definir qué modelo vamos a usar.

Paquete de modelos: https://www.statsmodels.org/stable/index.html 

* Documentación de fórmulas: https://www.statsmodels.org/dev/example_formulas.html  

* Fixed effects
    * Dummy variable con el user_id para descartar el efecto de eso.

* Control variables
    * En los papers prueban agregar variables de control de los propios tweets y de los usuarios.

* Interaction effect / cross-effect
    * Variables independientes que se afectan entre si.
    * En este caso podría ser que se consideren las emociones (por ejemplo, resumen de emociones positivas o negativas) o que consideremos los otros engagements.

* Multi-collinearity
    * ``variance_inflation_factor``. Se calcula para las diferentes variables.
    * ``variance_inflation_factor`` expects the presence of a constant in the matrix of explanatory variables. One can use ``add_constant`` from statsmodels to add the required constant to the dataframe before passing its values to the function.
    * Si usamos el ``dmatrices`` la constante se agrega.

* Mean centered
    * Se supone que ayuda. Se le aplica a variables numéricas/continuas.
    * Alternativa al log.

* Over-disperssion
    * cov_type puede ser nonrobust, HC0, HC1, HC2, HC3, HAC
    * standard robust sandwich covariances are available with the cov_type option in fit, which allows for heteroscedasticity robust (HC), cluster robust, and heteroscedasticity and autocorrelation robust (HAC) and two panel robust covariance estimators.
    * nonrobust assumes there is no overdispersion and underestimates the standard error.
    * HC0 se supone que corrige por overdispersion and heteroscedasticity.
    * When the response variable is a count, but μ does not equal σ2, the Poisson distribution is not applicable. Over dispersion can be detected by dividing the residual deviance by the degrees of freedom. If this quotient is much greater than one, the negative binomial distribution should be used. There is no hard cut off of “much larger than one”, but a rule of thumb is 1.10 or greater is considered large.
    * ``model.pearson_chi2 / model.df_resid`` 

In [None]:
import pandas as pd
from patsy import dmatrices
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf 

In [None]:
# levantamos todos los dfs que nos interesan para una combinación de nombre

def load_dfs(dir_data, name_part):
    df_dict = {}

#     df_dict['df_all'] = pd.read_pickle(dir_data + 'df_merged__all__' + name_part + '.pickle').sort_values(by='created_at')
    df_dict['df_no_covid_2019'] = pd.read_pickle(dir_data + 'df_merged__no_covid_2019__' + name_part + '.pickle').sort_values(by='created_at')
    df_dict['df_no_covid'] = pd.read_pickle(dir_data + 'df_merged__no_covid__' + name_part + '.pickle').sort_values(by='created_at')
    df_dict['df_pre_covid'] = pd.read_pickle(dir_data + 'df_merged__pre_covid__' + name_part + '.pickle').sort_values(by='created_at')
    df_dict['df_during_covid'] = pd.read_pickle(dir_data + 'df_merged__during_covid__' + name_part + '.pickle').sort_values(by='created_at')
    df_dict['df_post_covid'] = pd.read_pickle(dir_data + 'df_merged__post_covid__' + name_part + '.pickle').sort_values(by='created_at')

    return df_dict

dir_data = './df_merged/'
name_part = 'normalized_perc_85__tweet_None__df_tweets_social_dimensions_model_simplified' # para facilitar generar los nombres
# name_part = 'normalized_perc_85_word_length__tweet_None__df_tweets_social_dimensions_model_simplified'
# name_part = 'tweet_None__df_tweets_social_dimensions_model_simplified'

df_dict = load_dfs(dir_data, name_part)



In [None]:
import numpy as np

def mean_centered(x):
    return x - np.mean(x)
    
def get_formulas(dependent='favorite_count',cross_effects=['reply_count','retweet_count']):

    ten_dims = ['','support', 'knowledge', 'conflict', 'power', 'similarity', 'fun', 'status', 'trust', 'identity', 'romance']
    ten_dims_ = ['','support', 'knowledge', 'conflict', 'power', 'similarity', 'fun', 'status', 'trust', 'identity']

    tweet_control_log = ['','np.log(word_length)']
    tweet_control_expand_ = ['','np.log(word_length)', 'hashtag_count', 'mentions_count', 'url_count']
    tweet_control_expand = ['','np.log(word_length)', 'np.log(hashtag_count)', 'np.log(mentions_count)', 'np.log(url_count)']
    user_control_log = ['','np.log(followees)', 'np.log(followers)']
    user_control = ['','followees', 'followers', 'tweets_before']
    
    user_control_mean = ['','mean_centered(followees)', 'mean_centered(followers)', 'mean_centered(tweets_before)']
    
    tweet_control_expand2 = tweet_control_expand + ['mean_centered(tweets_before)']
    
    dict_formulas = {}

    predictor = dependent + ' ~'

    dict_formulas['formula_model1'] = predictor + " C(user_id)" # dummy para controlar el usuario

    dict_formulas['formula_model2'] = dict_formulas['formula_model1'] + ' + '.join(tweet_control_log) # control variables de los tweets
    dict_formulas['formula_model2a'] = dict_formulas['formula_model1'] + ' + '.join(tweet_control_expand) # control variables de los tweets
    dict_formulas['formula_model2b'] = predictor + ' + '.join(tweet_control_log) # control variables de los tweets
    dict_formulas['formula_model2c'] = predictor + ' + '.join(tweet_control_expand) # control variables de los tweets

    dict_formulas['formula_model2d'] = dict_formulas['formula_model1'] + ' + '.join(tweet_control_expand2)
    dict_formulas['formula_model2e'] = predictor + ' + '.join(tweet_control_expand2)
    
    
    dict_formulas['formula_model3'] = dict_formulas['formula_model2'] + ' + '.join(user_control_log)
    dict_formulas['formula_model3a'] = dict_formulas['formula_model2'] + ' + '.join(user_control_mean)
    dict_formulas['formula_model3b'] = dict_formulas['formula_model1'] + ' + '.join(user_control_log)
    dict_formulas['formula_model3c'] = dict_formulas['formula_model1'] + ' + '.join(user_control_mean)
    dict_formulas['formula_model3d'] = dict_formulas['formula_model2b'] + ' + '.join(user_control_log)
    dict_formulas['formula_model3e'] = dict_formulas['formula_model2c'] + ' + '.join(user_control_mean)

    dict_formulas['formula_model3f'] = dict_formulas['formula_model2'] + ' + '.join(user_control)
    dict_formulas['formula_model3g'] = dict_formulas['formula_model1'] + ' + '.join(user_control)
    dict_formulas['formula_model3h'] = dict_formulas['formula_model2b'] + ' + '.join(user_control)
    dict_formulas['formula_model3i'] = dict_formulas['formula_model2c'] + ' + '.join(user_control)
    
    dict_formulas['formula_model3j'] = predictor + ' + '.join(user_control_log)
    dict_formulas['formula_model3k'] = predictor + ' + '.join(user_control_mean)
    dict_formulas['formula_model3l'] = predictor + ' + '.join(user_control)

    dict_formulas['formula_model4'] = dict_formulas['formula_model3'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4a'] = dict_formulas['formula_model1'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4b'] = dict_formulas['formula_model2'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4c'] = dict_formulas['formula_model2a'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4d'] = dict_formulas['formula_model3a'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4e'] = dict_formulas['formula_model3b'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4f'] = dict_formulas['formula_model3c'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4g'] = dict_formulas['formula_model3d'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4h'] = dict_formulas['formula_model3e'] + ' + '.join(ten_dims)

    dict_formulas['formula_model4i'] = dict_formulas['formula_model3f'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4j'] = dict_formulas['formula_model3g'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4k'] = dict_formulas['formula_model3h'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4l'] = dict_formulas['formula_model3i'] + ' + '.join(ten_dims)
    
    dict_formulas['formula_model4m'] = dict_formulas['formula_model2c'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4n'] = dict_formulas['formula_model2b'] + ' + '.join(ten_dims)
    
    dict_formulas['formula_model4o'] = dict_formulas['formula_model3j'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4p'] = dict_formulas['formula_model3k'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4q'] = dict_formulas['formula_model3l'] + ' + '.join(ten_dims)
    
    dict_formulas['formula_model4m'] = dict_formulas['formula_model2d'] + ' + '.join(ten_dims)
    dict_formulas['formula_model4n'] = dict_formulas['formula_model2e'] + ' + '.join(ten_dims)
    
    cc = ' + '+ cross_effects[0] + ' : '
    dict_formulas['formula_model5'] = dict_formulas['formula_model4'] + cc.join(ten_dims) 
    dict_formulas['formula_model5a'] = dict_formulas['formula_model4a'] + cc.join(ten_dims) 
    dict_formulas['formula_model5b'] = dict_formulas['formula_model4b'] + cc.join(ten_dims) 
    dict_formulas['formula_model5c'] = dict_formulas['formula_model4c'] + cc.join(ten_dims) 
    dict_formulas['formula_model5d'] = dict_formulas['formula_model4d'] + cc.join(ten_dims) 
    dict_formulas['formula_model5e'] = dict_formulas['formula_model4e'] + cc.join(ten_dims) 
    dict_formulas['formula_model5f'] = dict_formulas['formula_model4f'] + cc.join(ten_dims) 
    dict_formulas['formula_model5g'] = dict_formulas['formula_model4g'] + cc.join(ten_dims) 
    dict_formulas['formula_model5h'] = dict_formulas['formula_model4h'] + cc.join(ten_dims) 

    dict_formulas['formula_model5i'] = dict_formulas['formula_model4i'] + cc.join(ten_dims) 
    dict_formulas['formula_model5j'] = dict_formulas['formula_model4j'] + cc.join(ten_dims) 
    dict_formulas['formula_model5k'] = dict_formulas['formula_model4k'] + cc.join(ten_dims) 
    dict_formulas['formula_model5l'] = dict_formulas['formula_model4l'] + cc.join(ten_dims) 
    
    dict_formulas['formula_model5m'] = dict_formulas['formula_model4m'] + cc.join(ten_dims) 
    dict_formulas['formula_model5n'] = dict_formulas['formula_model4n'] + cc.join(ten_dims) 
    
    cc = ' + '+ cross_effects[1] + ' : '
    dict_formulas['formula_model6'] = dict_formulas['formula_model4'] + cc.join(ten_dims) 
    dict_formulas['formula_model6a'] = dict_formulas['formula_model4a'] + cc.join(ten_dims) 
    dict_formulas['formula_model6b'] = dict_formulas['formula_model4b'] + cc.join(ten_dims) 
    dict_formulas['formula_model6c'] = dict_formulas['formula_model4c'] + cc.join(ten_dims) 
    dict_formulas['formula_model6d'] = dict_formulas['formula_model4d'] + cc.join(ten_dims) 
    dict_formulas['formula_model6e'] = dict_formulas['formula_model4e'] + cc.join(ten_dims) 
    dict_formulas['formula_model6f'] = dict_formulas['formula_model4f'] + cc.join(ten_dims) 
    dict_formulas['formula_model6g'] = dict_formulas['formula_model4g'] + cc.join(ten_dims) 
    dict_formulas['formula_model6h'] = dict_formulas['formula_model4h'] + cc.join(ten_dims) 

    dict_formulas['formula_model6i'] = dict_formulas['formula_model4i'] + cc.join(ten_dims) 
    dict_formulas['formula_model6j'] = dict_formulas['formula_model4j'] + cc.join(ten_dims) 
    dict_formulas['formula_model6k'] = dict_formulas['formula_model4k'] + cc.join(ten_dims) 
    dict_formulas['formula_model6l'] = dict_formulas['formula_model4l'] + cc.join(ten_dims) 

    dict_formulas['formula_model6m'] = dict_formulas['formula_model4m'] + cc.join(ten_dims) 
    dict_formulas['formula_model6n'] = dict_formulas['formula_model4n'] + cc.join(ten_dims) 
    
    dict_formulas['formula_model7'] = dict_formulas['formula_model4'] + cc.join(ten_dims_) 
    dict_formulas['formula_model7a'] = dict_formulas['formula_model4a'] + cc.join(ten_dims_) 
    dict_formulas['formula_model7b'] = dict_formulas['formula_model4b'] + cc.join(ten_dims_)  
    dict_formulas['formula_model7c'] = dict_formulas['formula_model4c'] + cc.join(ten_dims_) 
    dict_formulas['formula_model7d'] = dict_formulas['formula_model4d'] + cc.join(ten_dims_) 
    dict_formulas['formula_model7e'] = dict_formulas['formula_model4e'] + cc.join(ten_dims_)  
    dict_formulas['formula_model7f'] = dict_formulas['formula_model4f'] + cc.join(ten_dims_)  
    dict_formulas['formula_model7g'] = dict_formulas['formula_model4g'] + cc.join(ten_dims_) 
    dict_formulas['formula_model7h'] = dict_formulas['formula_model4h'] + cc.join(ten_dims_) 

    dict_formulas['formula_model7i'] = dict_formulas['formula_model4i'] + cc.join(ten_dims_) 
    dict_formulas['formula_model7j'] = dict_formulas['formula_model4j'] + cc.join(ten_dims_) 
    dict_formulas['formula_model7k'] = dict_formulas['formula_model4k'] + cc.join(ten_dims_) 
    dict_formulas['formula_model7l'] = dict_formulas['formula_model4l'] + cc.join(ten_dims_) 
    
    dict_formulas['formula_model7m'] = dict_formulas['formula_model4m'] + cc.join(ten_dims_) 
    dict_formulas['formula_model7n'] = dict_formulas['formula_model4n'] + cc.join(ten_dims_) 
    
    dict_formulas['formula_model8'] = dict_formulas['formula_model2'] + '+ retweet_count'
    dict_formulas['formula_model8a'] = dict_formulas['formula_model2a'] + '+ reply_count'
    dict_formulas['formula_model8b'] = dict_formulas['formula_model2'] + '+ retweet_count + reply_count'
    dict_formulas['formula_model8c'] = dict_formulas['formula_model2a'] + '+ retweet_count + reply_count'
    
    dict_formulas['formula_model8d'] = dict_formulas['formula_model8'] + ' + '.join(ten_dims)
    dict_formulas['formula_model8e'] = dict_formulas['formula_model8a'] + ' + '.join(ten_dims)
    dict_formulas['formula_model8f'] = dict_formulas['formula_model8b'] + ' + '.join(ten_dims)
    dict_formulas['formula_model8g'] = dict_formulas['formula_model8c'] + ' + '.join(ten_dims)
    
    cc = ' + '+ cross_effects[0] + ' : '
    dict_formulas['formula_model8h'] = dict_formulas['formula_model8d'] + cc.join(ten_dims) 
    dict_formulas['formula_model8i'] = dict_formulas['formula_model8e'] + cc.join(ten_dims) 
    dict_formulas['formula_model8j'] = dict_formulas['formula_model8f'] + ' + '.join(ten_dims)
    dict_formulas['formula_model8k'] = dict_formulas['formula_model8g'] + ' + '.join(ten_dims)
    
    cc = ' + '+ cross_effects[1] + ' : '
    dict_formulas['formula_model8l'] = dict_formulas['formula_model8h'] + cc.join(ten_dims) 
    dict_formulas['formula_model8m'] = dict_formulas['formula_model8i'] + cc.join(ten_dims) 
    dict_formulas['formula_model8n'] = dict_formulas['formula_model8j'] + cc.join(ten_dims)
    dict_formulas['formula_model8o'] = dict_formulas['formula_model8k'] + ' + '.join(ten_dims)      
    
    return dict_formulas

In [None]:
df_dims = df_dict['df_all']

model = smf.glm(formula = formula_model2, data=df_dims, family=sm.families.NegativeBinomial()).fit(cov_type='HC0')

model.summary()

In [None]:
model.pearson_chi2 / model.df_resid # para chequear overdispersion. 

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from collections import deque
from tqdm.notebook import tqdm

def calc_vif(X):
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    aa = deque()
    for i in tqdm(range(X.shape[1])):
        aa.append(variance_inflation_factor(X.values, i)) 
#     vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = aa

    return vif

In [None]:
import pickle
import os

dep = 'favorite_count'
cross = ['reply_count','retweet_count']

# dep = 'retweet_count'
# cross = ['reply_count','favorite_count']

dict_formulas = get_formulas(dep, cross)

for ww in df_dict:
    print('--------------------',ww)
    df_dims = df_dict[ww]
    fn = '__ivf__' + name_part + '__' + ww + '.pickle'
    print(fn)
    dict_ivf = {}
        
    if os.path.exists(fn):
        with open(fn,'rb') as file:
            dict_ivf = pickle.load(file)
    print(len(dict_ivf))
    processed = set(k.split('~')[1] for k in dict_ivf)
    for formula,form in dict_formulas.items():

#         if form in dict_ivf:
#             continue
        
        if form.split('~')[1] in processed:
            continue
    
        print(formula,'::',form)
        y, X = dmatrices(form, df_dims, return_type='dataframe') # esto agrega la constante
        df_vif = calc_vif(X)
        
        dict_ivf[form] = df_vif
        with open(fn,'wb') as file:
            pickle.dump(dict_ivf,file)

In [None]:
# correlación entre las features, para ver la collinearity
import matplotlib.pyplot as plt
import seaborn as sns

# plt.figure(figsize=(10,7))

# Generate a mask to onlyshow the bottom triangle
mask = np.triu(np.ones_like(df_dims.corr(), dtype=bool))

# generate heatmap
sns.heatmap(df_dims.corr(), annot=True, mask=mask, vmin=-1, vmax=1)
plt.title('Correlation Coefficient Of Predictors')
plt.show()