# Forbes Articles Crawling

For data, articles on various topics (such as:
[Money](https://www.forbes.com/money/),
[Leadership](https://www.forbes.com/leadership/),
[Worlds-Billionaires](https://www.forbes.com/worlds-billionaires/),
[Business](https://www.forbes.com/business/),
[Small Business](https://www.forbes.com/small-business/),
[Life Style](https://www.forbes.com/lifestyle/),
[Real State](https://www.forbes.com/real-estate/) and etc
)
are extracted from the pages of 
[Forbes](www.forbes.com) website.

> [Forbes](https://en.wikipedia.org/wiki/Forbes) is an American business magazine owned by Integrated Whale Media Investments and the Forbes family. Published eight times a year, it features articles on finance, industry, investing, and marketing topics. Forbes also reports on related subjects such as technology, communications, science, politics, and law.

**Scrapy** library in python was used to extract articles. The information extracted corresponding to each
article is as follows:

| Attribute | Description | Type |
| -- | -- | -- |
|context_header| category (context) of article | String |
|corpus_date_ymd| date of article publication (y/m/d) | String | 
|corpus_date_hm| date of release (h/m) | String |
|corpus_title| title of article | String |
|corpus_content_paragraphs | paragraphs of article content | List(String) |
|author_var_dict | profile of article author (described bellow.) | Dictionary |

So that <code>author_var_dict</code> attribute contains the following fields:

| Attribute | Description | Type |
| -- | -- | -- |
|author_forbes_url| forbes url of article author | String | 
| author_name | article author name | String |
| author_contrib_type | article author contributer type | String |
| author_subcontext_header | author field | String |
| author_about | a paragraph about article author | String |
| author_social_links | article author social links | List(String) | 

**Scrapy** library is installed with the following command:

<code>pip3 install scrapy</code>

and the current articles on the Forbes site are extracted in <code>file_name.json</code> using the following command:

<code>scrapy crawl forbes -O file_name.json</code>

# Import Libraries

In [66]:
import json

import numpy as np
import pandas as pd

import nltk
import spacy

import re
import string
from collections import Counter
from itertools import chain

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


# Aggregate Json Files Data 

In this section, <code>f{i}.json</code> format json files are aggregated and repeated items are removed.

In [31]:
n = 2
data = []
for i in range(n):
    file_i = open(f'f{i + 1}.json')
    dataset_i = json.load(file_i)
    data.extend(dataset_i)

titles = []
data_unique = []
for item in data:
    title = item['corpus_title']
    if title not in titles and 'author_var_dict' in item.keys():
        titles.append(title)
        data_unique.append(item)

data = data_unique

len(data)

140

# Dictionary to Dataframe

In [39]:
df = pd.json_normalize(data)

In [40]:
df["corpus_content_parts"] = df["corpus_content_parts"].apply(lambda x: list(filter(lambda item: item not in ["Share to Facebook", "Share to Twitter", "Share to Linkedin"] ,x)))
df["corpus_content"] = df['corpus_content_parts'].apply(lambda x: " ".join(x))


# Some Samples From Contexts

In [42]:
series = df["corpus_content"].apply(lambda x: x.replace("", ""))
samples = df.groupby("context_header").apply(lambda x: x.sample(n=1))["corpus_content"]
pd.DataFrame(samples)

Unnamed: 0_level_0,Unnamed: 1_level_0,corpus_content
context_header,Unnamed: 1_level_1,Unnamed: 2_level_1
Billionaires,133,Japanese auto maker Suzuki Motor plans to inve...
Business,53,With Paramount’s The Lost City earning decen...
Leadership,137,"I nflation may be hitting new 40-year highs ,..."
Lifestyle,26,"For hip hop star and fashion icon A$AP Rocky, ..."
Money,91,Key News Asian equities were mixed overnight a...
Real Estate,102,The Biden administration has established a tas...
Small Business,48,"Most companies want to grow. Often, the very s..."


# Download And Setup Pre-Trained Model

To download and setup spacy pre-trained language model, uncomment the following command.

In [43]:
# !python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

# Extract Specifications of Corpus (1)

In [44]:
def extract_basic_specefications_of(text):
    doc = nlp(text)

    # collect doc.sents generator 
    sentences = list(doc.sents)

    # doc tokens
    tokens = doc[:]

    n_all_tokens = len(tokens)
    n_sentences = len(sentences)

    average_sentence_len = sum(len(sentence) for sentence in sentences) / n_sentences

    n_puncs = sum(map(lambda token: token.is_punct, tokens))
    n_stopwords = sum(map(lambda token: token.is_stop, tokens))

    # print(f'{n_sentences: <5}{n_total_tokens: <5}{n_puncs: <5}{n_stopwords: <5}')
    return n_sentences, average_sentence_len, n_all_tokens, n_puncs, n_stopwords


df[['No Sentences', 'Average Lengths Of Sentences', 'No All Tokens', 'No Punctuation Tokens', 'No Stopwords']] = df.apply(lambda item: extract_basic_specefications_of(item['corpus_content']), axis=1,
                                                                                                                          result_type='expand')

# Normalization



For normalization, the following functions are applied to the text of the given articles, respectively:
<center>
    $\text{Input Text} \Rightarrow \text{Lemmatize} \Rightarrow \text{Remove Stopwords} \Rightarrow \text{Remove Punctuations} \Rightarrow \text{Output Tokens}$
</center>

In [45]:
def normalizeـcomponent(doc: str):
    doc = doc.lower().strip()

    doc = nlp(doc)

    # lemmatization
    r = map(lambda token: token.lemma_, doc)

    # remove whitespaces
    r = map(lambda token: token.strip(), r)

    # remove stopwords    
    r = list(filter(lambda token: not nlp.vocab[token].is_stop, r))

    # remove punctuations
    r = list(filter(lambda token: token not in string.punctuation, r))

    return r


df['normalized_corpus_words'] = df['corpus_content'].map(normalizeـcomponent)

# Extract Specification of Corpus (2)

In [46]:
def extract_other_specefications_of(words):
    n_unique_words = len(set(words))

    n_words = len(words)
    average_word_len = sum(len(word) for word in words) / n_words
    longest_word_len = max(len(word) for word in words)

    return n_words, n_unique_words, average_word_len, longest_word_len


df[['No Words', 'No Unique Words', 'Average Words Length', 'Longest Word Length']] = df.apply(lambda item: extract_other_specefications_of(item['normalized_corpus_words']), axis=1,
                                                                                              result_type='expand')


# Extract Most Frequent Words of Corpus

In [74]:
def extract_n_most_frequent_word(words: list, n=20):
    word_freq = Counter(words)
    common_words_and_frequency = word_freq.most_common(n)

    return common_words_and_frequency


df['Common Words and Frequencies'] = df['normalized_corpus_words'].map(extract_n_most_frequent_word)

gb = df.groupby("context_header").apply(lambda x: list(chain.from_iterable(x['normalized_corpus_words'].tolist())))
gb

context_header
Billionaires      [private, jet, priceless, diamond, run, law, —...
Business          [record, £, 5.6, billion, 7.3, billion, spend,...
Leadership        [medium, technology, tool, fuel, information, ...
Lifestyle         [entertain, easter, come, april, 17, lot, pres...
Money             [real, earning, season, february, 14, –, march...
Real Estate       [cliffside, malibu, estate, belong, supermodel...
Small Business    [ceo, co, founder, inspirant, group, curator, ...
dtype: object

# Aggregate Statistics About Articls

In [21]:
n = 10

df[['context_header',
    'corpus_title',
    'No Sentences',
    'Average Lengths Of Sentences',
    'No All Tokens',
    'No Punctuation Tokens',
    'No Stopwords',
    'No Words',
    'No Unique Words',
    'Average Words Length',
    'Longest Word Length',
    'Common Words and Frequencies'
    ]].sample(n)

Unnamed: 0,context_header,corpus_title,No Sentences,Average Lengths Of Sentences,No All Tokens,No Punctuation Tokens,No Stopwords,No Words,No Unique Words,Average Words Length,Longest Word Length,Common Words and Frequencies
121,Small Business,Trebel Hopes Going Public Will Take It To The ...,37.0,24.432432,904.0,120.0,392.0,397.0,241.0,5.846348,13.0,"[(music, 15), (mekikian, 13), (trebel, 11), (–..."
99,Leadership,"Lottery Numbers, Blockchain Articles And Cold ...",62.0,23.967742,1486.0,188.0,598.0,663.0,407.0,5.974359,15.0,"[(russian, 14), (news, 10), (russia, 10), (rsf..."
130,Billionaires,Mutual Fund Billionaire Edward “Ned” Johnson I...,30.0,22.833333,685.0,78.0,252.0,343.0,225.0,5.822157,14.0,"[(johnson, 20), (fidelity, 10), (company, 9), ..."
96,Money,Is Charles Schwab Stock Fairly Priced?,23.0,26.73913,615.0,87.0,206.0,302.0,165.0,5.390728,14.0,"[(revenue, 13), (net, 11), (billion, 9), (y, 7..."
60,Business,Aspirin Improves Survival For Hospitalized Cov...,13.0,38.076923,495.0,56.0,183.0,250.0,158.0,6.004,13.0,"[(aspirin, 10), (patient, 10), (hospital, 7), ..."
105,Lifestyle,Faith Connexion Sets The Tone For AW22 With Se...,18.0,33.944444,611.0,79.0,219.0,295.0,217.0,6.027119,22.0,"[(faith, 11), (new, 9), (collection, 7), (conn..."
72,Billionaires,Malaysian Billionaire Stanley Thai Invests $35...,12.0,30.333333,364.0,34.0,139.0,170.0,125.0,6.105882,13.0,"[(supermax, 9), (u.s, 7), (glove, 5), (year, 5..."
39,Lifestyle,Boeing 737 Crash: Why It Could Be Years Before...,31.0,26.0,806.0,107.0,341.0,341.0,229.0,6.13783,14.0,"[(recorder, 6), (baier, 6), (aviation, 6), (ch..."
138,Leadership,4 Reasons Why SMS Is The Future Of E-Commerce,81.0,15.888889,1287.0,187.0,514.0,567.0,312.0,5.798942,14.0,"[(marketing, 20), (sms, 14), (sm, 12), (email,..."
80,Billionaires,PUBG Developer Teams Up With Solana Labs To La...,12.0,34.25,411.0,48.0,136.0,206.0,149.0,6.228155,15.0,"[(blockchain, 10), (game, 8), (billionaire, 5)..."


# Extract Keywords Of Articles

In [22]:
# from rake_nltk import Rake
#
# r = Rake()
#
# r.extract_keywords_from_text(df['corpus_content'][0])
#
# r.get_ranked_phrases_with_scores()

import numpy as np

tfidf = TfidfVectorizer(tokenizer=lambda x:x, lowercase=False, ngram_range=(1,3))
r = tfidf.fit_transform(df['normalized_corpus_words'].tolist())
response = r.toarray()
feature_array = np.array(tfidf.get_feature_names())

tfidf_df = pd.DataFrame(response, columns=feature_array)
df['Keywords'] = tfidf_df.apply(lambda row: (row.nlargest(5).index.tolist()), axis='columns')
for keywords in df['Keywords'].tolist():
    print(keywords)

['earning', 'gaap', 'gaap earning', 'million', 'analyst']
['fink', 'russia', 'war', 'globalization', 'world order']
['russian', 'sanction', 'putin', 'russia', 'treasury']
['dao', 'hoenisch', 'ethereum', 'hacker', 'attacker']
['loan', 'student loan', 'payment', 'refund', 'student']
['ftx', 'license', 'binance', 'crypto', '— —']
['tax', 'retirement', 'bill', 'r&d', 'restore']
['climate', 'sec', 'disclosure', 'information', 'investor']
['loan', 'student loan', 'student', 'borrower', 'payment']
['dei', 'bellan', 'bellan white', 'tech', 'journalism']
['leviev', 'diamond', 'de beer', 'beer', 'de']
['skill', 'shortage', 'industry', 'film', 'coach']
['team', 'lesson', 'culture', 'behavior', 'eternal']
['ham', 'biscuit', 'lamb', 'meat', 'mail order']
['malibu', 'beach', 'mansion', 'floor', 'crawford']
['lease', 'sentral', 'apartment', 'landing', 'residence']
['symbium', 'city', 'housing', 'adu', 'planning department']
['bathroom', 'renovation', 'diy', 'tile', 'budget']
['eqt', 'bpea', 'private 