<a href="https://colab.research.google.com/github/saljoofri/Sample-Map-Website/blob/gh-pages/ASPI_NLP_tester_Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 1. Load data into a pandas dataframe : df



In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/ASPI Strategist/ASPI_Strategist_2021-09-11_09-12_AEST.csv', index_col = 0)

print(df.head())
print(df.info())

                                               title  \
0  Australia needs to take India more seriously a...   
1   India’s strategic options for dealing with China   
2  Figures reveal slump in Australia’s trade with...   
3      Australian bananas: the view from New Zealand   
4   Breaking the Australia–China media feedback loop   

                        date              authors  \
0  2020-07-21 11:20:47+10:00  ['Jagannath Panda']   
1  2020-07-10 12:03:15+10:00   ['Shashi Tharoor']   
2  2020-08-10 15:00:51+10:00       ['David Uren']   
3  2018-08-29 14:30:14+10:00      ['Colin James']   
4  2018-05-24 11:30:12+10:00      ['Fergus Ryan']   

                                                tags  \
0  ['Australia', 'India', 'Indo-Pacific', 'defenc...   
1                  ['China', 'India', 'geopolitics']   
2       ['Australia', 'India', 'economics', 'trade']   
3  ['Jacinda Ardern', 'New Zealand', 'democracy',...   
4             ['Australia', 'China', 'social media']   

       

Step 2. Data cleansing

Step 2a. Create basic text features - 'num_chars', 'num_words'

In [None]:
# correct data type for 'text' column, as originally detected as a float (not sure why...)
df['text'] = df['text'].astype(str)

# create a 'num_chars' feature
df['num_chars'] = df['text'].apply(len)

# define custom function that returns number of words in a string: word_count 
def word_count(string):
    # split the string into words 
    words = string.split() 
    # return length of words list 
    return len(words)

# create a 'num_words' feature
df['num_words'] = df['text'].apply(word_count)

df.describe()

          num_chars    num_words
count   4850.000000  4850.000000
mean    5246.500000   821.790103
std     1872.832181   296.027081
min        2.000000     0.000000
25%     4717.000000   738.000000
50%     5389.000000   842.000000
75%     6044.500000   947.000000
max    30291.000000  4976.000000


Step 2b. Remove rows from dataframe with less than 500 words

In [None]:
df = df[df['num_words'] >= 500]
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4459 entries, 0 to 4849
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      4459 non-null   object
 1   date       4459 non-null   object
 2   authors    4459 non-null   object
 3   tags       4459 non-null   object
 4   text       4459 non-null   object
 5   url        4459 non-null   object
 6   num_chars  4459 non-null   int64 
 7   num_words  4459 non-null   int64 
dtypes: int64(2), object(6)
memory usage: 313.5+ KB
None
          num_chars    num_words
count   4459.000000  4459.000000
mean    5636.832698   882.775959
std     1343.556508   214.883341
min     2964.000000   500.000000
25%     4897.000000   766.000000
50%     5479.000000   857.000000
75%     6116.500000   957.000000
max    30291.000000  4976.000000


In [None]:
df['text'][100] # sample text from dataframe before pre-processing

'Three weeks after Americans went to the polls, the morass of conspiracy theories and disinformation surrounding the election and its results continues to grow. Although the US is half a world away, Australians don’t have the luxury of watching this maelstrom as uninterested observers. The conspiracy information ecosystem is highly international, and here in Australia conspiracy groups are often dominated by narratives and content emerging from the US. As the conspiratorial tidal wave swamps America, ripples are already reaching Australia—and are likely to have implications for our own elections in 2022. Since around mid-March, Australians have witnessed incredible growth in the spread of conspiracy theories. While this content has spread largely online, conspiracy-fuelled anti-lockdown protests and arrests around the country, particularly in Melbourne, have demonstrated its ability to translate into unrest and conflict in the offline world. Many of these conspiracy theories originate 

Step 3. Pre-process text

In [None]:
import re

# define custom function to pre-process text: pre_process
def pre_process(text):
  # convert to lowercase
  text = text.lower()
  # remove leading/trailing spaces
  text = text.strip()
  # remove excess spaces between words
  text = re.sub(r'\s+', ' ', text) 
  # remove digits
  text = re.sub(r'[0-9]+', '', text)
  return text

df['text'] = df['text'].apply(pre_process)

In [None]:
df['text'][100] # sample text from dataframe after pre-processing

'three weeks after americans went to the polls, the morass of conspiracy theories and disinformation surrounding the election and its results continues to grow. although the us is half a world away, australians don’t have the luxury of watching this maelstrom as uninterested observers. the conspiracy information ecosystem is highly international, and here in australia conspiracy groups are often dominated by narratives and content emerging from the us. as the conspiratorial tidal wave swamps america, ripples are already reaching australia—and are likely to have implications for our own elections in . since around mid-march, australians have witnessed incredible growth in the spread of conspiracy theories. while this content has spread largely online, conspiracy-fuelled anti-lockdown protests and arrests around the country, particularly in melbourne, have demonstrated its ability to translate into unrest and conflict in the offline world. many of these conspiracy theories originate in t

Step 4. Create a list of documents from the 'text' column: docs

In [None]:
docs = df['text'].tolist()

Generate tf-idf matrix

In [None]:
# import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer()

# apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(docs)

# print result of toarray() method
print(csr_mat.toarray().shape)

# get the words: words
words = tfidf.get_feature_names()

# print words
print(words)

(4459, 46525)


In [None]:
type(csr_mat)

scipy.sparse.csr.csr_matrix

In [None]:
# perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

# create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components = 50)

# create a KMeans instance: kmeans
kmeans = KMeans(n_clusters = 50)

# create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)

In [None]:
# Fit the pipeline to articles
pipeline.fit(csr_mat)

# Calculate the cluster labels: labels
labels = pipeline.predict(csr_mat)

# Create a DataFrame aligning labels and titles: df
df1 = pd.DataFrame({'label': labels, 'article': df['title']})

# Display df sorted by cluster label
print(df1.sort_values('label'))

      label                                            article
3877      0                 The Strategist Six: Danilo Pamonag
1049      0   Australia and Indonesia: an enduring partnership
1052      0        Australia, the US and the Indo-Asia-Pacific
3538      0  Better civic education will help Australians r...
2173      0                   Marise Payne: lessons from D-Day
...     ...                                                ...
4650     49                   Keeping the Balkan ghosts at bay
3130     49  Macron is Biden’s best bet for a European partner
4651     49                     How Europe can live with China
1835     49                          Europe’s complacency trap
1130     49               Will Italy’s populists upend Europe?

[4459 rows x 2 columns]


In [None]:
df1.to_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/test.csv')

Sandpit - NMF

In [None]:
# import NMF
from sklearn.decomposition import NMF

# create an NMF model: model
model = NMF(n_components=20)

model.fit(csr_mat)

nmf_features = model.transform(csr_mat)

array([[7.68733189e-03, 1.40906131e-07, 1.42797544e-02, ...,
        2.27763153e-02, 2.83481133e-02, 4.39485840e-02],
       [1.82119221e-02, 0.00000000e+00, 2.07325756e-04, ...,
        2.56603404e-02, 0.00000000e+00, 3.24841362e-03],
       [4.25077069e-02, 0.00000000e+00, 1.40815437e-03, ...,
        1.92138170e-02, 4.09670557e-03, 0.00000000e+00],
       ...,
       [1.86528519e-02, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 5.58916099e-03],
       [3.29342592e-02, 2.10552277e-08, 4.13910697e-04, ...,
        1.02844541e-02, 0.00000000e+00, 2.60108446e-02],
       [4.44212125e-02, 5.49362735e-07, 1.36397077e-03, ...,
        2.47571658e-02, 0.00000000e+00, 7.40926618e-02]])

In [1]:
print(model.components_.shape)

NameError: ignored