# Analyzing data science articles

This notebook recreate the study present [here](https://medium.com/the-mission/this-new-data-will-make-you-rethink-how-you-write-headlines-751358f6639a) focusing in data science articles. 

The data for this notebook can be found [here](https://www.kaggle.com/viniciuslambert/medium-data-science-articles-dataset).


## Possible questions:

- What is the influence of headlines in popularity?
- Does reading time influence in popularity?
- does the author matter?
- there was a better day to post?


## CRISP-DM

- Business Undestanding
- Data Undestanding
- Prepare Data
- Model Data
- Result 
- Deploy

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

pd.set_option('display.max_colwidth', None)

df = pd.read_csv('medium-data-science-articles-2020.csv')
df.head()

Unnamed: 0,url,title,author,author_page,subtitle,claps,responses,reading_time,tag,date
0,https://towardsdatascience.com/making-python-programs-blazingly-fast-c1cd79bd1b32,Making Python Programs Blazingly Fast,martin.heinz,https://towardsdatascience.com/@martin.heinz,Let’s look at the performance of our Python programs and see how…,3300.0,3,5,Data Science,2020-01-01
1,https://towardsdatascience.com/how-to-be-fancy-with-python-8e4c53f47789,How to be fancy with Python,dipam44,https://towardsdatascience.com/@dipam44,Python tricks that will make your life easier,1700.0,12,5,Data Science,2020-01-01
2,https://uxdesign.cc/how-exactly-do-you-find-insights-from-qualitative-user-research-603bcafbc8b3,How exactly do you find insights from qualitative user research?,taylornguyen144,https://uxdesign.cc/@taylornguyen144,Visualizing the synthesis processes…,1100.0,3,4,Data Science,2020-01-01
3,https://towardsdatascience.com/from-scratch-to-search-playing-with-your-data-elasticsearch-ingest-pipelines-6d054bf5d866,From scratch to search: playing with your data (Elasticsearch Ingest Pipelines),stanislavprihoda,https://towardsdatascience.com/@stanislavprihoda,One Pipeline to rule…,232.0,1,9,Data Science,2020-01-01
4,https://www.cantorsparadise.com/the-waiting-paradox-an-intro-to-probability-distributions-97c0aedb8c1,The Waiting Paradox: An Intro to Probability Distributions,maikeelisa,https://www.cantorsparadise.com/@maikeelisa,How much longer do I have to wait for my…,859.0,5,8,Data Science,2020-01-01


In [27]:
df.shape

(108021, 10)

## Find url duplicateds values and drop it

In [31]:
print(df.url.duplicated().sum())
df[df.url.duplicated(keep=False)]

50


Unnamed: 0,url,title,author,author_page,subtitle,claps,responses,reading_time,tag,date
512,https://towardsdatascience.com/top-10-technology-trends-for-2020-4a179fdd53b1,Top 10 Technology Trends for 2020,ryanraiker,https://towardsdatascience.com/@ryanraiker,Strategies and things that will change the way we think and work,3100.0,12,10,Data Science,2020-01-03
4211,https://medium.com/@mike-meyer/redefining-our-information-as-wealth-a118388a7992,Redefining Our Information as Wealth,mike-meyer,https://medium.com/@mike-meyer,How information as assets will correct our economic distortions,87.0,0,6,Data,2020-01-16
4469,https://medium.com/@mike-meyer/redefining-our-information-as-wealth-a118388a7992,I know that Umair has complained a lot about how self-identity was considered the important issue,Aelle1,https://medium.com/@Aelle1,"I think special consideration needs to be made about large quantities of data, data that is too big to be…",5.0,0,2,Data,2020-01-17
6235,https://medium.com/@pkwete/predicting-the-wuhan-coronavirus-global-spread-c662bf5c5bb3,Predicting the Wuhan Coronavirus’ Global Spread,pkwete,https://medium.com/@pkwete,,0.0,0,2,Data Science,2020-01-26
10222,https://onezero.medium.com/how-to-find-out-what-google-and-other-big-tech-companies-know-about-you-649fd368d10e,How to Find Out What Google and Other Big Tech Companies Know About You,tomsmith585,https://onezero.medium.com/@tomsmith585,It’s illuminating — and a bit…,4200.0,24,7,Artificial Inteligence,2020-02-11
...,...,...,...,...,...,...,...,...,...,...
100795,https://towardsdatascience.com/python-alone-wont-get-you-a-data-science-job-a780085ac640,The Zero Knowledge Audience,tigerarcades,https://medium.com/@tigerarcades,,1.0,1,2,Data Science,2020-12-06
100823,https://medium.com/technology-hits/introduction-to-technology-hits-7665b8d5e950,I’ll hug Technology Hits,dviggo,https://medium.com/@dviggo,,97.0,1,1,Data Science,2020-12-06
103683,https://towardsdatascience.com/how-fast-is-c-compared-to-python-978f18f474c7,How Fast Is C++ Compared to Python?,tamimi-naser,https://towardsdatascience.com/@tamimi-naser,An example for data scientists who believe they don’t need to know C++,2000.0,58,4,Data Science,2020-12-16
104448,https://towardsdatascience.com/how-fast-is-c-compared-to-python-978f18f474c7,Generate all the combinations (k-mers) of four nucleotides in C++,razvnpp,https://medium.com/@razvnpp,The story is my reply to the original article here:https://towardsdatascience.com/how-fast-is-c-compared-to-python-978f18f474c7,0.0,0,1,Data Science,2020-12-18


In [30]:
df[df.url == 'https://towardsdatascience.com/how-fast-is-c-compared-to-python-978f18f474c7']

Unnamed: 0,url,title,author,author_page,subtitle,claps,responses,reading_time,tag,date
103683,https://towardsdatascience.com/how-fast-is-c-compared-to-python-978f18f474c7,How Fast Is C++ Compared to Python?,tamimi-naser,https://towardsdatascience.com/@tamimi-naser,An example for data scientists who believe they don’t need to know C++,2000.0,58,4,Data Science,2020-12-16
104448,https://towardsdatascience.com/how-fast-is-c-compared-to-python-978f18f474c7,Generate all the combinations (k-mers) of four nucleotides in C++,razvnpp,https://medium.com/@razvnpp,The story is my reply to the original article here:https://towardsdatascience.com/how-fast-is-c-compared-to-python-978f18f474c7,0.0,0,1,Data Science,2020-12-18


In [36]:
# I analysed the values and conclude that the correct duplicated values 
#is aways the first! So let's keep it.

df = df.drop_duplicates(subset=['url'], keep='first')
print(df.url.duplicated().sum())
df.shape

0


(107971, 10)

## Get most popular words in titles

### cleaning the data

In [45]:
titles = df.title.tolist()
titles

['Making Python Programs Blazingly Fast',
 'How to be fancy with\xa0Python',
 'How exactly do you find insights from qualitative user research?',
 'From scratch to search: playing with your data (Elasticsearch Ingest Pipelines)',
 'The Waiting Paradox: An Intro to Probability Distributions',
 'Sentiment Analysis of Movie Reviews in NLTK\xa0Python',
 'How to Write Scripts That Check Data Quality For\xa0You',
 'Gradient Based Optimizations: Jacobians, Jababians &\xa0Hessians',
 '[Time Series Forecast] Anomaly detection with Facebook\xa0Prophet',
 'Decision Trees for\xa0Dummies',
 'Kaggle User Survey\xa02019',
 'Intuition behind Naive Bayes algorithm & Laplace(Additive) smoothing',
 'Visualizing marginal effects using ggeffects in\xa0R',
 'The Decade of Data\xa0Science',
 'Smallest Neural Network for Beginners',
 'Why PadhAI?',
 'Linear Algebra Data Structures and Operations',
 'How To Manage Data Science Project In\xa02020',
 'Useful Probability Distributions and Structured Probabilistic

As you can see, it has some confusing character, so we need to normalize the data.

In [63]:

normalized_titles = []
for title in titles:
    title = unicodedata.normalize("NFKD", title) # normalize data to remove '\xa0'
    title = re.sub('<[^>]+>', '', title) # remove anything beteween <> (html noises)
    normalized_titles.append(title)
    
normalized_titles

['Making Python Programs Blazingly Fast',
 'How to be fancy with Python',
 'How exactly do you find insights from qualitative user research?',
 'From scratch to search: playing with your data (Elasticsearch Ingest Pipelines)',
 'The Waiting Paradox: An Intro to Probability Distributions',
 'Sentiment Analysis of Movie Reviews in NLTK Python',
 'How to Write Scripts That Check Data Quality For You',
 'Gradient Based Optimizations: Jacobians, Jababians & Hessians',
 '[Time Series Forecast] Anomaly detection with Facebook Prophet',
 'Decision Trees for Dummies',
 'Kaggle User Survey 2019',
 'Intuition behind Naive Bayes algorithm & Laplace(Additive) smoothing',
 'Visualizing marginal effects using ggeffects in R',
 'The Decade of Data Science',
 'Smallest Neural Network for Beginners',
 'Why PadhAI?',
 'Linear Algebra Data Structures and Operations',
 'How To Manage Data Science Project In 2020',
 'Useful Probability Distributions and Structured Probabilistic Models',
 'A Complete Model D

In [65]:
words_count = {}

for title in normalized_titles:
    splited_title = title.split(' ')
    if len(splited_title) > 2:
        for i in range((len(splited_title) - 2)):
            word_group = f'{splited_title[i]} {splited_title[i+1]} {splited_title[i+2]}'
            if word_group in words_count.keys():
                words_count[word_group] += 1 # apeear onne more time
            else:
                words_count[word_group] = 1 # appear for the first time

In [66]:
sorted(words_count.items(), key=lambda x: x[1])

{'Making Python Programs': 1,
 'Python Programs Blazingly': 1,
 'Programs Blazingly Fast': 1,
 'How to be': 33,
 'to be fancy': 2,
 'be fancy with': 2,
 'fancy with Python': 1,
 'How exactly do': 1,
 'exactly do you': 1,
 'do you find': 1,
 'you find insights': 1,
 'find insights from': 1,
 'insights from qualitative': 1,
 'from qualitative user': 1,
 'qualitative user research?': 1,
 'From scratch to': 1,
 'scratch to search:': 1,
 'to search: playing': 1,
 'search: playing with': 1,
 'playing with your': 1,
 'with your data': 3,
 'your data (Elasticsearch': 1,
 'data (Elasticsearch Ingest': 1,
 '(Elasticsearch Ingest Pipelines)': 1,
 'The Waiting Paradox:': 1,
 'Waiting Paradox: An': 1,
 'Paradox: An Intro': 1,
 'An Intro to': 17,
 'Intro to Probability': 1,
 'to Probability Distributions': 1,
 'Sentiment Analysis of': 34,
 'Analysis of Movie': 2,
 'of Movie Reviews': 1,
 'Movie Reviews in': 1,
 'Reviews in NLTK': 1,
 'in NLTK Python': 1,
 'How to Write': 20,
 'to Write Scripts': 1,
