# Are stories with numbers in headlines more successful?

## Let's use a dataset of 6k+ articles and some statistics to find out

In this notebook I did a 2 sample hypothesis test to check if the expected number of claps is bigger for articles with numbers in their headlines compared to headlines without numbers.

**Sample 1: Articles with numbers in headlines**  
We will model the number of claps inside this group as **n** i.i.d. (independent and identically distributed) random variables: $X_1, X_2, …, X_n$ with expected value $\mu_1$ and variance $\sigma_1^2$, both of which are finite.  
 
**Sample 2: Articles without numbers in headlines**  
We will model the number of claps inside this group as **m** i.i.d. random variables: $Y_1, Y_2, …, Y_m$ with expected value $\mu_2$ and variance $\sigma_2^2$, both of which are finite.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from scipy.stats import norm

In [None]:
def like(x, pattern):
    r = re.compile(pattern)
    vlike = np.vectorize(lambda val: bool(r.fullmatch(val)))
    return vlike(x)

In [None]:
df = pd.read_csv('../input/medium-articles-dataset/medium_data.csv')

In [None]:
df

In [None]:
df['title'].isna().sum()

In [None]:
df['claps'].isna().sum()

Neither 'title' nor 'claps' columns contain NaNs.

In [None]:
regex = '.*[0-9]+.*'

In [None]:
df_numbers = df.loc[like(df['title'], regex), ['title', 'claps']]

In [None]:
df_not_numbers = df.loc[~like(df['title'], regex), ['title', 'claps']]

In [None]:
df_numbers

In [None]:
df_not_numbers

We will consider the following hypotheses:  
  
$$
\begin{cases}
H_0: \mu_1 \leq \mu_2 \\
H_1: \mu_1 \gt \mu_2
\end{cases}
$$

And the following test statistic:  
  
$$
Z = \frac{(\bar{X_n} - \bar{Y_m}) - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n} + \frac{\sigma_2^2}{m}}}
$$

In [None]:
n = len(df_numbers.index)
n

In [None]:
m = len(df_not_numbers.index)
m

In [None]:
x_bar = df_numbers['claps'].values.mean()
x_bar

In [None]:
y_bar = df_not_numbers['claps'].values.mean()
y_bar

In [None]:
var1 = df_numbers['claps'].values.var()
var1

In [None]:
var2 = df_not_numbers['claps'].values.var()
var2

$\mu_1 - \mu_2 = 0$ by the null hypothesis. So our test statistic is:

In [None]:
z = (x_bar - y_bar)/np.sqrt(var1/n + var2/m)
z

And, our p-value is:

In [None]:
p = 1 - norm.cdf(z)
p

And we got a p-value much smaller than the usual threshold of 0.05. That's good news, we can reject the null hypothesis very confidently.

For a significance level of $\alpha$ = 0.001, it follows that p $\approx$ 0.0009 < $\alpha$, and therefore we reject the null hypothesis and accept the alternative. In plain English, this means: "**We are 99.9% confident that stories with numbers in their headlines are expected to have more claps than stories without numbers in headlines**".