In [2]:
import pandas as pd 
import scipy.stats as stats
import numpy as np
import sys

In [3]:
# fetch from local path otherwise fetch from the remote path (if remote kernel is being used)
try: 
    sys.path.append('../../scripts')
    import database
    import utils
except:
    try:
        sys.path.append('./scripts')
        import database
        import utils
    except:
        raise RuntimeError('Failed to import from both local and remote paths. Program terminated.')

# First Hypothesis
In the following notebook it will be analyzed the following hypothesis:
 - **Does the length of the tweets, have an impact on the positivity or negativity of the review?**

In [4]:
db, mongo = database.setup_database()

In [5]:
tweets = pd.DataFrame(list(db.tweets.find()))
tweets

Unnamed: 0,_id,date,flag,ids,target,text,user
0,65da1cbbce9ba6aa1ef39331,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,1467810369,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",_TheSpecialOne_
1,65da1cbbce9ba6aa1ef39332,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,1467810672,0,is upset that he can't update his Facebook by ...,scotthamilton
2,65da1cbbce9ba6aa1ef39333,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,1467810917,0,@Kenichan I dived many times for the ball. Man...,mattycus
3,65da1cbbce9ba6aa1ef39334,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,1467811184,0,my whole body feels itchy and like its on fire,ElleCTF
4,65da1cbbce9ba6aa1ef39335,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,1467811193,0,"@nationwideclass no, it's not behaving at all....",Karoli
5,65da1cbbce9ba6aa1ef39336,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,1467811372,0,@Kwesidei not the whole crew,joy_wolf
6,65da1cbbce9ba6aa1ef39337,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,1467811592,0,Need a hug,mybirch
7,65da1cbbce9ba6aa1ef39338,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,1467811594,0,@LOLTrish hey long time no see! Yes.. Rains a...,coZZ
8,65da1cbbce9ba6aa1ef39339,Mon Apr 06 22:20:05 PDT 2009,NO_QUERY,1467811795,0,@Tatiana_K nope they didn't have it,2Hood4Hollywood
9,65da1cbbce9ba6aa1ef3933a,Mon Apr 06 22:20:09 PDT 2009,NO_QUERY,1467812025,0,@twittera que me muera ?,mimismo


### Metrics (length)
For performing this analysis, we obtain the length of each tweet and  include it as an additional column in the dataframe.

In [6]:
tweets['length'] = tweets['text'].apply(len)

In [7]:
tweets.describe()

Unnamed: 0,ids,target,length
count,1600000.0,1600000.0,1600000.0
mean,1998818000.0,2.0,74.04238
std,193576100.0,2.000001,36.38849
min,1467810000.0,0.0,6.0
25%,1956916000.0,0.0,44.0
50%,2002102000.0,2.0,69.0
75%,2177059000.0,4.0,104.0
max,2329206000.0,4.0,359.0


# Correlation Test
We are using Spearman's correlation coefficient test to find out whether we have positive correlation or negative correlation with the length of the tweet. Furthermore, we are also inspecting the value of p to assess the significance of the length feature.

In [8]:
# Piersen correlation
spearman_r_stat, p_value = spearman_r_stat, p_value = stats.spearmanr(tweets['target'], tweets['length'])

# Interpretation of the Resuts
Since the value of the coefficient is close to zero it means that it has a very weak correlation. Although, the value of p is statistically significant since it is less that 0.05.

In [9]:
print(f'Spearman correlation coefficient: {spearman_r_stat:.4f} p-value: {p_value}')

if p_value < 0.05:
    print('The correlation is significant')
else:
    print('Do not reject the null hypothesis')

Spearman correlation coefficient: -0.0057 p-value: 5.86891864637797e-13
The correlation is significant


# Effect Size
To invalidate the significance fallacy, we used Cohen's d test to see the effect size for the two groups.

In [10]:
group1 = tweets[tweets['target'] == 0]['length']
group2 = tweets[tweets['target'] == 4]['length']
cohen_d = utils.cohen_d(group1, group2)
print(f'Cohen\'s d: {cohen_d:.4f}') 

Cohen's d: 0.0126


# Interpretation of Cohen's d:
 - Cohen's d values around 0.2 are considered small effect sizes.
 - Values around 0.5 represent medium effect sizes.
 - Values of 0.8 or higher indicate large effect sizes.

The effect size obtained is close to zero, which suggests extremely small effect size. Therefore, there is very little difference between the means of the two groups relative to the variablity within the groups. We can infer that the difference is likely not practically significant, although it is statistically significant due to large sample size.

In [11]:
mongo.close()