## Introduction

### News

Crawling: Crawl news and information websites.
Machine learning:
Determine whether a given article is fake.
Find out what category of news the article falls into.
Predict the likelihood that the article will go viral.

### Ecommerce
Crawling: Crawl Ecommerce websites.
Machine learning:
Determine what colors and designs are in vogue, something that I’m sure marketers would kill for, using image recognition and review sentiment analysis.
Predict which brands/retailers are doing well (I’ve worked on something similar), something that Wall Street could be in the market for.
Find out product pages from across websites that refer to the same product, a key task for price comparison, using RNNs.

### Music
Crawling: Crawl lyrics sites to get lyrics of songs across the web.
Machine learning:
Use a generative model to auto-generate your own song lyrics (AI lyricist anybody?).
Identify plagiarised songs by detecting common patterns in songs by different artists; build a Song2Vec perhaps?

### Government
Crawling: Pull datasets/data from government websites.
Machine learning:
Identify trends in climate change and predict future temperature rises, using regression and more.
Predict population migration patters or population increases using historical data.
### Books
Crawling: Pull the content of publicly available books.
Machine learning:
Identify works of writing that may have been plagiarized (Book2Vec?).
Auto-generate book summaries using Seq2Seq techniques (this is a bit out there considering the mature of technology today, but is perhaps one for the future).

Now let us look something interesting...

### Online News Crawling and Virality Prediction

Today, we will learn how to crawl news from popular news websites and then anticipate the likelihood of virality of that crawled news. Read the article till the end.

Firstly, we will crawl the news from a popular news website of your choice. In my case, I crawled the Times of India website.

1. Install newspaper library from PyPI: (for python 3)

In [11]:
# pip install newspaper3k

2. Import all the necessary libraries:


In [12]:
'''
import requests
from bs4 import BeautifulSoup
from newspaper import Article
import pandas as pd
import numpy as np
'''

'\nimport requests\nfrom bs4 import BeautifulSoup\nfrom newspaper import Article\nimport pandas as pd\nimport numpy as np\n'

3. Set the URL for news website and start making request:

In [15]:
# url = "https://timesofindia.indiatimes.com/world"r = requests.get(url)

4. Get the soup of articles on website using beautiful soup and fetch all the links present in soup in a variable:

In [16]:
''''
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.findAll('a', attrs = {'class':'w_img'})

'''

"'\nsoup = BeautifulSoup(r.content, 'html5lib')\ntable = soup.findAll('a', attrs = {'class':'w_img'})\n\n"

5. Download the article from the links using newspaper library:

In [17]:
'''
for i in news:
article = Article(i, language="en")
article.download()
article.parse()
article.nlp()
data={}
data['Title']=article.title
data['Text']=article.text
data['Summary']=article.summary
data['Keywords']=article.keywords
'''

'\nfor i in news:\narticle = Article(i, language="en")\narticle.download()\narticle.parse()\narticle.nlp()\ndata={}\ndata[\'Title\']=article.title\ndata[\'Text\']=article.text\ndata[\'Summary\']=article.summary\ndata[\'Keywords\']=article.keywords\n'

Now, We will make a regression model for predicting the virality of news

1. Download the Online News Dataset from UCI Repository
2. Load and split the data into x_train, y_train, x_test, y_test

In [18]:
'''
full_data = clean_cols(pd.read_csv(FILEPATH))
train_set, test_set = train_test_split(full_data, test_size=0.20, random_state=42)
'''

'\nfull_data = clean_cols(pd.read_csv(FILEPATH))\ntrain_set, test_set = train_test_split(full_data, test_size=0.20, random_state=42)\n'

3. Drop non-predicting columns
4. We will use RandomForest Regressor for training our regression model:

In [19]:
'''
clf = RandomForestRegressor(random_state=42)
clf.fit(x_train, y_train)

'''

'\nclf = RandomForestRegressor(random_state=42)\nclf.fit(x_train, y_train)\n\n'

Now our model is ready, Let’s convert our crawled news into the format of our prediction model because we cannot simply provide just the text, we have to find features from the news text so that they can help to predict the virality.
Features will include:

1. No. of words in title
2. No. of words in content(news article)
3. Rate of unique tokens
4. Rate of non-stop words
5. Rate of non-stop unique words
6. No. of URL’s present in content
7. No. of Images in content
8. No. of Videos in content
9. Average word length
10. No. of unique keywords
11. type of data channel: lifestyle, entertainment, business, technology, social media, world
12. day of week
13. Subjectivity
14. Sentiment polarity
15. Rate of negative words
16. Rate of Positive words
17. Average Negative polarity
18. Average Positive Polarity
19. Minimum Positive polarity
20. Maximum Positive Polarity
21. Maximum Negative Polarity
22. Minimum Negative Polarity
23. Title Subjectivity
24. Title Sentiment Polarity

Append all these features to a list and make a data frame out of it.

Use this data frame to predict the virality of news using our regression model.

Here is an Example..

### Extracted information

news-please extracts the following attributes from news articles. Also, have a look at an examplary json file extracted by news-please.

1. headline
2. lead paragraph
3. main text
4. main image
5. name(s) of author(s)
6. publication date
7. language
8. Features

works out of the box: install with pip, add URLs of your pages, run :-)
run news-please conveniently using its CLI mode
use it as a library within your own software
extract articles from commoncrawl.org's news archive

### Modes and use cases

news-please supports three main use cases, which are explained in more detail in the following.

### CLI mode

stores extracted results in JSON files, PostgreSQL, ElasticSearch, or your own storage
simple but extensive configuration (if you want to tweak the results)
revisions: crawl articles multiple times and track changes

### Library mode

crawl and extract information given a list of article URLs
to use news-please within your own Python code

News archive from commoncrawl.org
commoncrawl.org provides an extensive, free-to-use archive of news articles from small and major publishers world wide
news-please enables users to conveniently download and extract articles from commoncrawl.org
you can optionally define filter criteria, such as news publisher(s) or the date period, within which articles need to be published

clone the news-please repository, install the awscli tool, adapt the config section in newsplease/examples/commoncrawl.py, and execute python3 -m newsplease.examples.commoncrawl
