# pyCCA

#### 210508

This notebook helps with doing Connected Concept Analysis in Python. It does (roughly) the same job as [Textometrica](http://textometrica.humlab.umu.se).

For a description of CCA, refer to this paper:

>Lindgren, S. (2016). ["Introducing Connected Concept Analysis: A network approach to big text datasets"](https://doi.org/10.1515/text-2016-0016). _Text & Talk: An Interdisciplinary Journal of Language, Discourse & Communication Studies_ 36(3): 341–362.

CCA is a workflow for combining manual thematic coding with a form of [NTA](http://www.casos.cs.cmu.edu/publications/protected/1995-1999/1995-1997/carley_1997_networktext.PDF).

#### 1. Setup

Import the required libraries.

In [16]:
import pandas as pd
import numpy as np
import requests
import networkx as nx
from sklearn.feature_extraction.text import CountVectorizer

Import English stopword list.

In [12]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stops = [i for i in stopwords.words('english')]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Import Swedish stopword list.

In [22]:
response = requests.get('https://raw.githubusercontent.com/simonlindgren/pyCCA/master/swestops.txt')
swestops = response.text.splitlines()

Set your desired stopword list.

In [25]:
#stops = swestops
stops = stops

Tweak stopword removal.

Set as `keep_words` a list of standard stopwords, but that should not be removed.

Set as `extra_stops` a list of non standard stopwords that should be removed.


In [26]:
# Set up stopwords
keep_words = [] # name standard stopwords that should not be removed
stops = [i for i in stops if not i in keep_words] # load standard stopwords
extra_stops = [] # name non-standard stopwords to be removed
for i in extra_stops:
    stops.append(i)
stops = frozenset(stops)

Import the corpus from a file with one document per line.

In [29]:
corpus = [doc.strip() for doc in open("docs.txt")]
len(corpus)

1645

#### 2. Vectorize and count

Vectorize the corpus, while removing stopwords, and only keeping words with >2 letters, and no numerical or special characters.

We use `ngram_range = (1,2)` to get unigrams and bigrams.

In [31]:
%%time
# Vectorize and count
cv = CountVectorizer(ngram_range=(1, 2),
                     strip_accents = 'unicode',
                     stop_words=stops,
                     token_pattern="[a-zA-Z][a-zA-Z]+") # at least two letters, and no numerical or special characters
dtm = cv.fit_transform(corpus)

  'stop_words.' % sorted(inconsistent))


CPU times: user 2.81 s, sys: 30.8 ms, total: 2.84 s
Wall time: 2.85 s


Get an ordered list of all token names.

In [32]:
wordlist = cv.get_feature_names()

Get document frequencies for tokens (i.e. how many documents they occur in, no matter how many times).

In [33]:
docfreqs = list(np.squeeze(np.asarray((dtm != 0).sum(0)))) # count number of non-zero document occurrences for each row (i.e. each word)

Save token idnumbers, token names, and document frequencies in a dataframe.

In [34]:
countsDF = pd.DataFrame(zip(wordlist,docfreqs)).reset_index()
countsDF.columns = ["id","token", "DF"]

Keep only the top 3000 tokens, by document frequency.

In [35]:
countsDF = countsDF.sort_values(by="DF", ascending=False).head(3000)

Present the user with a list of top tokens (document frequency) in the corpus.

In [36]:
with open("words.txt", "w") as f:
    for term in countsDF.token:
        f.write(term + "\n")

#### 3. Word selection

<hr>
**Before continuing:**

- Open the `words.txt` file in an editor and delete all lines with words that you do _not_ want to keep for analysis. 
- Save the file with the same name, in the same directory.

<hr>

#### 4. Define concepts

In [37]:
analysis_words = [w.strip() for w in open("words.txt", "r").readlines()]
df = pd.DataFrame(list(zip(analysis_words, analysis_words)), columns=['word','concept'])
df.to_csv("concepts.csv", index=None)

<hr>
**Before continuing:**

- Open the `concepts.csv` file in an editor and enter concept names in the second column. 
- Leave the header row (`word,concept`) as it is.
- If the word should not belong to a conceptual category, leave it as it is. 
- If you want to exclude the word from analysis, delete its entire row.
- Save the file with the same name in the same directory.

<hr>

In [38]:
tokensDF = pd.DataFrame([w.strip() for w in open("words.txt", "r").readlines()])
tokensDF.columns = ['token']
finalDF = pd.merge(tokensDF,countsDF)

In [39]:
tokenids_we_want = list(finalDF.id)

#### 5. Get co-occurrences

Get a smaller document-term matrix with only the tokens we want.

In [40]:
dtm2 = dtm[:, tokenids_we_want]
dtm2.shape

(1645, 43)

Get co-occurrences, and write as `networkx` graph.

In [41]:
cooc_matrix = (dtm2.T * dtm2) # this is cooccurrence matrix in sparse csr format
cooc_matrix.setdiag(0) # fill same word cooccurence to 0
cooc_matrix = cooc_matrix.todense() # convert sparse to dense

#### 6. Graph preparation

In [42]:
G=nx.from_numpy_matrix(cooc_matrix)
print(nx.info(G))

Name: 
Type: Graph
Number of nodes: 43
Number of edges: 903
Average degree:  42.0000


<hr>
Keep only edges with a weight > than the `cutoff`.

The cell below can be iterated with different cutoffs to see the size of the resulting `G2` graph.

In [49]:
cutoff = 1
top = [edge for edge in G.edges(data=True) 
       if edge[2]['weight'] > cutoff]
G2 = nx.Graph(top)
print(nx.info(G2))

Name: 
Type: Graph
Number of nodes: 43
Number of edges: 884
Average degree:  41.1163


<hr>
Replace the numeric token labels with full text versions.

In [50]:
labels = finalDF.token # an iterable of labels (in the right order)
H = nx.relabel_nodes(G2, labels)

In [51]:
# Read the concepts.csv into a dictionary
import csv

reader = csv.reader(open("concepts.csv"))
next(reader, None) # skip the header line in the file
newlabels = {}
for row in reader:
    key = row[0]
    value = row[1]
    newlabels.update( {key : value} )

In [52]:
I = nx.relabel_nodes(H, newlabels)

Save the graphs in Gephi format.

In [53]:
nx.write_gexf(H, "pyCCA_words.gexf")
nx.write_gexf(I, "pyCCA_concepts.gexf")