So, is there really a difference between the work of highly-ranked and lower-ranked tech bloggers? Let's find out.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
`
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
authors = pd.read_csv('../input/identifying-influential-bloggers-techcrunch/authors.csv', header = None)
posts = pd.read_csv('../input/identifying-influential-bloggers-techcrunch/posts.csv', header = None)

I want to see if I can predict if a blogger has a good or bad rating solely from the text of the blog posts. First, I'm going to look at the data and build a dataframe specific to the model I'm trying to build.

In [None]:
authors.head()

In [None]:
authors

In [None]:
posts.head()

In [None]:
author_score = dict(zip(authors.iloc[:,1], authors.iloc[:,2]))

In [None]:
data = pd.DataFrame()
data['author'] = posts.iloc[:,2]
data['score'] = data['author'].map(author_score)
data['posts'] = posts.iloc[:,5]

In [None]:
data.head()

Now, I'm going to clean the text of the posts.

In [None]:
import string

def clean_text(text):
    words = str(text).split()
    words = [i.lower() + " " for i in words]
    words = " ".join(words)
    words = words.translate(words.maketrans(' ', ' ', string.punctuation))
    return words

data['posts'] = data['posts'].apply(clean_text)

In [None]:
data.head()

I want to separate the bloggers into two categories based on their scores. I don't want to say these categories are "good" and "bad" bloggers, because I'm sure they're all good, so I will call them "highly ranked" and "lower ranked". 

In [None]:
import matplotlib.pyplot as plt
plt.style.use('default')
plt.plot(authors.iloc[:,2].values, 'o')
plt.xlabel("Index")
plt.ylabel("Blogger Score")
plt.axhline(30, c = 'r')
plt.text(60, 60, "Highly ranked bloggers")
plt.text(60, 22, "Lower ranked bloggers")

I'll sort the bloggers into each of these two categories.

In [None]:
data['category'] = data['score'].apply(lambda x: 0 if x < 30 else 1)

In [None]:
data.head()

In [None]:
from sklearn.model_selection import train_test_split


train, test = train_test_split(data)
train, val = train_test_split(train)

To process the text data, I will use the keras text vectorization layer. For more information, see: https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization

In [None]:
import tensorflow as tf
from tensorflow import keras
from keras import layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

max_tokens = 10000



vectorize_layer = TextVectorization(
    max_tokens=max_tokens,
    output_sequence_length=200,
)

vectorize_layer.adapt(data.posts.values)

Now, I will use KerasTuner, which is a great tool for hyperparameter tuning, to help me build the most optimal model. For more information about KerasTuner, see: https://keras-team.github.io/keras-tuner/

In [None]:
import kerastuner as kt
def build_model(hp):
  model = keras.Sequential([
    tf.keras.Input(shape=(1,), dtype=tf.string),
    vectorize_layer,
    layers.Embedding(max_tokens+1,hp.Int('units', min_value = 32, max_value = 512, step = 32)),
    layers.Bidirectional(layers.LSTM(hp.Int('units', min_value = 32, max_value = 512, step = 32))),
    layers.Dense( hp.Int('units', min_value = 32, max_value = 512, step = 32), activation='relu'),
    layers.Dense(1, activation = 'sigmoid')
  ])
  hp_learning_rate = hp.Choice('learning_rate', values = [1e-2, 1e-3, 1e-4]) 

  model.compile(loss='binary_crossentropy',
                optimizer=tf.keras.optimizers.Adam(learning_rate = hp_learning_rate),
                metrics=['accuracy'])
  return model

In [None]:
tuner = kt.Hyperband(build_model,
                     objective = 'val_accuracy', 
                     max_epochs = 10,
                     factor = 3)

Now, the tuner object searchs for the optimal hyperparameters.

In [None]:
import IPython

class ClearTrainingOutput(tf.keras.callbacks.Callback):
  def on_train_end(*args, **kwargs):
    IPython.display.clear_output(wait = True)

tuner.search(train.posts.values, train.category.values, 
          epochs = 10,
          verbose = 0,
          validation_data = (val.posts.values, val.category.values),
          callbacks = [ClearTrainingOutput()])


And now, we can just get the best model from the tuner and train it on our data.

In [None]:
best_hps = tuner.get_best_hyperparameters(num_trials = 1)[0]
model = tuner.hypermodel.build(best_hps)
model.fit(train.posts.values, train.category.values, 
          epochs = 10,
          verbose = 2,
          validation_data = (val.posts.values, val.category.values))

In [None]:
model.evaluate(test.posts.values, test.category.values)

While the model isn't super accurate, it definitely shows that there is a relationship between a blogger's posts and if they're highly ranked or not. So, maybe their posts really are just better.

But, just splitting them into two categories might not be enough. There's a big spread in the blogger scores. What if we separated them into three categories- say, good, better, and best?

In [None]:
plt.plot(authors.iloc[:,2].values, 'o')
plt.xlabel("Index")
plt.ylabel("Blogger Score")
plt.axhline(30, c = 'r')
plt.text(60, 60, "Best")
plt.text(22, 22, "Better")
plt.axhline(8, c = 'r')
plt.text(-3, 0, "Good")

In [None]:
def get_score_cat(score):
    if score < 8:
        return 0
    if score < 30:
        return 1
    return 2

In [None]:
data['new_score_category'] = data['score'].apply(get_score_cat)

In [None]:
train, test = train_test_split(data)
train, val = train_test_split(train)

In [None]:
def build_model_2(hp):
  model = keras.Sequential([
    tf.keras.Input(shape=(1,), dtype=tf.string),
    vectorize_layer,
    layers.Embedding(max_tokens+1,hp.Int('units', min_value = 32, max_value = 512, step = 32)),
    layers.Bidirectional(layers.LSTM(hp.Int('units', min_value = 32, max_value = 512, step = 32))),
    layers.Dense( hp.Int('units', min_value = 32, max_value = 512, step = 32), activation='relu'),
    layers.Dense(3, activation = 'softmax')
  ])
  hp_learning_rate = hp.Choice('learning_rate', values = [1e-2, 1e-3, 1e-4]) 

  model.compile(loss='sparse_categorical_crossentropy',
                optimizer=tf.keras.optimizers.Adam(learning_rate = hp_learning_rate),
                metrics=['accuracy'])
  return model

In [None]:
tuner = kt.Hyperband(build_model_2,
                     objective = 'val_accuracy', 
                     max_epochs = 10,
                     factor = 3)

In [None]:
tuner.search(train.posts.values, train.new_score_category.values, 
          epochs = 10,
          verbose = 0,
          validation_data = (val.posts.values, val.new_score_category.values),
          callbacks = [ClearTrainingOutput()])

In [None]:
best_hps = tuner.get_best_hyperparameters(num_trials = 1)[0]
model = tuner.hypermodel.build(best_hps)
model.fit(train.posts.values, train.new_score_category.values, 
          epochs = 10,
          verbose = 2,
          validation_data = (val.posts.values, val.new_score_category.values))

In [None]:
model.evaluate(test.posts.values, test.new_score_category.values)

There's definitely a relationship here- not a perfect one, obviously, and my model could probably be improved, but this means that the text of the blogger's posts can somewhat predict their ranking. There are definitely other factors (brands, comments, etc), but I think it's pretty cool that the actual text matters this much.

Going forward, a really cool thing to do would be to try to write a regression model to predict scores, or try to predict scores in more categories that better represent the data.