# Internet News and Consumer Engagement
This dataset contains data on news articles published between early September to early November 2019. It's enriched by Facebook engagement data, such as the number of shares, comments, and reactions. The dataset was first created to predict the popularity of an article before it was published; however, there is a lot more you can analyze!

Not sure where to begin? Scroll to the bottom to find challenges!

In [None]:
import pandas as pd

df =pd.read_csv("news_articles.csv", index_col=0)

print(df.head())

            source_id         source_name              author  \
0             reuters             Reuters   Reuters Editorial   
1     the-irish-times     The Irish Times  Eoin Burke-Kennedy   
2     the-irish-times     The Irish Times   Deirdre McQuillan   
3  al-jazeera-english  Al Jazeera English          Al Jazeera   
4            bbc-news            BBC News            BBC News   

                                               title  \
0  NTSB says Autopilot engaged in 2018 California...   
1       Unemployment falls to post-crash low of 5.2%   
2  Louise Kennedy AW2019: Long coats, sparkling t...   
3  North Korean footballer Han joins Italian gian...   
4  UK government lawyer says proroguing parliamen...   

                                         description  \
0  The National Transportation Safety Board said ...   
1  Latest monthly figures reflect continued growt...   
2  Autumn-winter collection features designer’s g...   
3  Han is the first North Korean player in the S

## Data dictionary

|    | Variable                        | Description                                                                  |
|---:|:--------------------------------|:-----------------------------------------------------------------------------|
|  0 | source_id                       | publisher unique identifier                                                  |
|  1 | source_name                     | human-readable publisher name                                                |
|  2 | author                          | article author                                                               |
|  3 | title                           | article headline                                                             |
|  4 | description                     | article short description                                                    |
|  5 | url                             | article URL from publisher website                                           |
|  6 | url_to_image                    | url to main image associated with the article                                |
|  7 | published_at                    | exact time and date of publishing the article                                |
|  8 | content                         | unformatted content of the article truncated to 260 characters               |
|  9 | top_article                     | value indicating if article was listed as a top article on publisher website |
| 10 | engagement_reaction_count       | users reactions count for posts on Facebook involving article URL            |
| 11 | engagement_comment_count        | users comments count for posts on Facebook involving article URL             |
| 12 | engagement_share_count          | users shares count for posts on Facebook involving article URL               |
| 13 | engagement_comment_plugin_count | Users comments count for Facebook comment plugin on article website          |

[Source](https://www.kaggle.com/szymonjanowski/internet-articles-data-with-users-engagement) of dataset.

## Don't know where to start? 

**Challenges are brief tasks designed to help you practice specific skills:**

- 🗺️ **Explore**: What publishers and authors publish the most content based on this dataset? How about most engaging content?
- 📊 **Visualize**: Create two words clouds for the title and description of the articles to find the most popular words. Make sure to remove stop words!
- 🔎 **Analyze**: On days where total engagement was higher than usual, can you identify a common event or theme based on text?

**Scenarios are broader questions to help you develop an end-to-end project for your portfolio:**

You have a friend who works as a reporter for BBC news. He's been disappointed in his articles' low Facebook engagement and that his articles have never been listed as top articles on the BBC. You've offered your help by finding data-driven recommendations on how he should position his articles (such as guidelines on title and description) and when in the day he should publish articles. He's interested in what makes a top article at BBC and what gets the most Facebook engagement.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

---

✍️ _If you have an idea for an interesting Scenario or Challenge, or have feedback on our existing ones, let us know! You can submit feedback by pressing the question mark in the top right corner of the screen and selecting "Give Feedback". Include the phrase "Content Feedback" to help us flag it in our system._

In [None]:

highest_content_author = df.groupby('author').count()['content'].idxmax()
highest_content = df.groupby('author').count()['content'].max()

print(highest_content_author,":",highest_content)

The Associated Press : 975


In [None]:
df.columns

Index(['source_id', 'source_name', 'author', 'title', 'description', 'url',
       'url_to_image', 'published_at', 'content', 'top_article',
       'engagement_reaction_count', 'engagement_comment_count',
       'engagement_share_count', 'engagement_comment_plugin_count'],
      dtype='object')

In [None]:
columns_list = ['engagement_reaction_count', 'engagement_comment_count',
       'engagement_share_count', 'engagement_comment_plugin_count']


df['total engagement'] = df[columns_list].sum(axis=1)

print(df['total engagement'].head(50))

0     2528.0
1       18.0
2        0.0
3        7.0
4        0.0
5        0.0
6      817.0
7        0.0
8       22.0
9      135.0
10      38.0
11      49.0
12      85.0
13     716.0
14       0.0
15       4.0
16      50.0
17       1.0
18       1.0
19       4.0
20      24.0
21      18.0
22       6.0
23       3.0
24       0.0
25      10.0
26     140.0
27       8.0
28     517.0
29      10.0
30    2608.0
31       0.0
32       0.0
33       0.0
34       5.0
35       0.0
36       4.0
37     272.0
38    1824.0
39       4.0
40    5144.0
41      10.0
42       0.0
43       0.0
44    1045.0
45       0.0
46       0.0
47     166.0
48     780.0
49       1.0
Name: total engagement, dtype: float64


In [None]:

highest_engagement_author = df.loc[df['total engagement'] == df['total engagement'].max()][['author','total engagement']]
print(highest_engagement_author)

                              author  total engagement
8500  Elizabeth Wolfe And Brian Ries          434855.0


In [None]:
from collections import Counter
words_count_title = " ".join(map(str,df["title"])).split()

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
A_title = [word for word in words_count_title if word not in stopwords.words('english')]

[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
words_counting_title = Counter(A_title).most_common()

print(words_counting_title[0:30])

[('The', 643), ('Trump', 477), ('says', 461), ('-', 347), ('US', 319), ('new', 311), ('U.S.', 309), ('New', 277), ('A', 265), ('Brexit', 215), ('How', 203), ('Dorian', 178), ('Is', 171), ('Hurricane', 166), ('—', 164), ('World', 159), ('could', 142), ('Saudi', 133), ('Johnson', 131), ("Trump's", 131), ('China', 130), ('Hong', 130), ('Man', 122), ('Ireland', 122), ('What', 122), ('first', 121), ('UK', 120), ('2020', 116), ('With', 116), ('Wall', 116)]


In [None]:
words_count_description = " ".join(map(str,df["description"])).split()
A_description = [word for word in words_count_description if word not in stopwords.words('english')]
words_counting_description = Counter(A_description).most_common()

print(words_counting_description[0:30])

[('The', 2331), ('said', 826), ('A', 790), ('new', 703), ('President', 611), ('says', 592), ('news', 575), ('top', 568), ('U.S.', 556), ('Trump', 524), ('national', 515), ('world', 514), ('video', 479), ('online', 466), ('breaking', 427), ('Get', 421), ('broadcast', 418), ('exclusive', 418), ('news.', 418), ('ABC', 416), ('news,', 412), ('Find', 412), ('coverage,', 408), ('interviews.', 408), ('one', 389), ('two', 388), ('first', 379), ('New', 372), ('people', 365), ('could', 315)]
