# pyCCA

This notebook helps with doing Connected Concept Analysis in Python. It does (roughly) the same job as [Textometrica](http://textometrica.humlab.umu.se).

For a description of CCA, refer to this paper:

>Lindgren, S. (2016). "Introducing Connected Concept Analysis: A network approach to big text datasets". _Text & Talk: An Interdisciplinary Journal of Language, Discourse & Communication Studies_ 36(3): 341–362. [doi](https://doi.org/10.1515/text-2016-0016

CCA is a workflow for combining manual [thematic coding](http://dx.doi.org/10.1191/1478088706qp063oa) with a form of [NTA](http://www.casos.cs.cmu.edu/publications/protected/1995-1999/1995-1997/carley_1997_networktext.PDF).

First, import the required libraries.

In [83]:
import re
import os
import numpy as np
import pandas as pd
import itertools
import networkx as nx
import plotly
import glob

from nltk.util import ngrams
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

Put a file named `docs.txt` in the working directory of this notebook. The file should have one document per line.

Import, tokenize and clean the file.

In [84]:
docs = [d.lower() for d in open("docs.txt", "r").readlines()]
docs = [[w.strip() for w in d.split()] for d in docs]

# Perform a series of cleaning operations
dks = []

for doc in docs:
    tks = [re.sub("\.|,|\:|/|\"|\?|-|…|'|\(|\)|\!|\+","", t) for t in doc]
    tks = [t for t in tks if not t.startswith("http")]
    tks = [t for t in tks if not "http" in t]
    tks = [t for t in tks if not t.startswith("pictwitter")]
    tks = [t for t in tks if len(t) > 1 or t == "i"]
    tks = [t for t in tks if not t.startswith("#")]
    tks = [t for t in tks if not t.startswith("@")]
    tks = [t for t in tks if not t.isnumeric()]
    tks = [t for t in tks if len(t) > 2]
    tks = [t for t in tks if len(t) < 25]
    dks.append(tks)

docs = dks

Extract and add ngrams (bigrams and trigrams).

In [85]:
docs2 = []

for doc in docs:
    
    tokenlist = []
    
    for word in doc:
        tokenlist.append(word)
    
    bigrams=ngrams(doc,2)
    trigrams=ngrams(doc,3)
    
    bigramlist = []
    for b in bigrams:
        bg = b[0] + "_" + b[1]
        #print(bg)
        tokenlist.append(bg)
        
    
    trigramlist = []
    for t in trigrams:
        tg = t[0] + "_" + t[1] + "_" + t[2]
        #print(tg)
        tokenlist.append(tg)
    
    docs2.append(tokenlist)
    
docs = docs2

Present the user with a list of top tokens (document frequency) in the corpus.

In [86]:
stop = list(stop)

manual_stopwords = ["see", "get"] # ADD MANUAL STOPWORDS HERE!
for ms in manual_stopwords:
    stop.append(ms)


words = [w for doc in docs for w in set(doc) if w not in stop]

In [88]:
### Added bit to process long word lists in RAM-friendly chunks

# Define chunking function to yield successive n-sized chunks from a list
def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

# Use the function to divide the list into chunks of 50000 words each        
words_chunked = chunks(words, 50000)


# Iterate over chunks and create a sub-dataframe with word frequencies for each chunk
# Pickle each df do disk

try:
    os.mkdir("datapickles")
except: pass

for c, chunk in enumerate(words_chunked):
    wordcounts = np.unique(chunk, return_counts=True)
    counted_words = wordcounts[0].tolist()
    word_counts = wordcounts[1].tolist()
    df = pd.DataFrame(list(zip(counted_words, word_counts)), columns=['word','DF']).sort_values(by="DF", ascending=False).reset_index(drop=True)   
    df.to_pickle("datapickles/dataframe" + str(c) + ".pkl")

In [89]:
# Get pickled frames
pickled_frames = glob.glob("datapickles/*.pkl")

# Create an empty dataframe
big_df = pd.DataFrame()

# Join all dataframes into the big_df
for pf in pickled_frames:
    df = pd.read_pickle(pf)
    keepers = df['DF'] > 19
    df = df[keepers]   
    big_df = pd.concat([big_df, df])

df = big_df.groupby(['word'])['DF'].sum().reset_index().sort_values(by="DF", ascending = False)

In [92]:
# Optional: delete all files in datapickles directory

#for f in glob.glob("datapickles/*"):
#    os.remove(f)

In [91]:
with open("words.txt", "w") as f:
    for w in df.word:
        f.write(w + "\n")

df.head(10)

Unnamed: 0,word,DF
119,women,235
56,men,196
79,sexual,186
48,know,123
55,many,107
91,stories,103
82,sexually,94
64,one,92
9,assault,90
66,people,90


<hr>
**Before continuing:**

- Open the `words.txt` file in an editor and delete all lines with words that you do _not_ want to keep for analysis. 
- Save the file with the same name, in the same directory.

<hr>

**Now, continue** and prepare for conceptualisation.

In [65]:
analysis_words = [w.strip() for w in open("words.txt", "r").readlines()]
df = pd.DataFrame(list(zip(analysis_words, analysis_words)), columns=['word','concept'])
df.to_csv("concepts.csv", index=None)

<hr>
**Before continuing:**

- Open the `concepts.csv` file in an editor and enter concept names in the second column. 
- Leave the header row (`word,concept`) as it is.
- If the word should not belong to a conceptual category, leave it as it is. 
- If you want to exclude the word from analysis, delete its entire row.
- Save the file with the same name in the same directory.

<hr>

**Now, continue** and prepare for network analysis.

First, get all co-occurrence pairs by going through each `doc`.

In [6]:
df = pd.DataFrame.from_csv("concepts.csv", index_col=None)

co_occurrences = []

for count,doc in enumerate(docs):
    print("\rProcessing document " + str(count+1) + "/" + str(len(docs)), end="")
    matches = []
    for word,concept in zip(df.word.tolist(),df.concept.tolist()):
        if word in doc:
            matches.append(concept)
    if (len(set(matches))) > 1:
        co_occurrences.append(list(set(matches)))

Processing document 4000/4000

How many of the documents included co-occurrences?

In [7]:
print(str(len(co_occurrences)) + " out of " + str(len(docs)))

2247 out of 4000


Some examples of co-occurrences.

In [8]:
for cooc in co_occurrences[:7]:
    print(cooc)

['join', 'like', 'twitter', 'happening', 'theres', 'with_you', 'thats']
['wearing', 'started', 'going', 'every', 'day', 'getting', 'think', 'every_day']
['read', 'you_can', 'enough', 'start', 'safe', 'feel', 'brave', 'hope', 'help']
['someone', 'really', 'needs']
['husband', 'better', 'dear', 'future']
['next', 'and_then', 'for_the']
['twitter', 'good', 'later']


As seen above, our co-occurrences are lists of matches. But for network visualisation, we want pairs.

Therefore, we go through all `co_occurrences` and use the `itertools` library to find all pairwise combinations within our co-occurrence-lists.

In [9]:
sources = []
targets = []

for count,c in enumerate(co_occurrences):
    print("\rProcessing co-occurrence " + str(count+1) + "/" + str(len(co_occurrences)), end="")
    pairs = itertools.combinations(c,2)
    for p in list(pairs):
        sources.append(p[0])
        targets.append(p[1])

Processing co-occurrence 2247/2247

Some example edges.

In [10]:
shortlist = list(zip(sources,targets))

for source,target in shortlist[:7]:
    print(source + ";" + target)

join;like
join;twitter
join;happening
join;theres
join;with_you
join;thats
like;twitter


We write the full list of such pairs to a file named `cca.csv` that can be opened in Gephi as an edgelist.

In [11]:
edges_df = pd.DataFrame(list(zip(sources, targets)), columns=['source','target'])
edges_df.to_csv("cca.csv", index=None)

In [12]:
from networkx import set_node_attributes,betweenness_centrality
from pyd3netviz import ForceChart
import numpy as np

In [13]:
# First create a multigraph to allow multiple edges between the same pairs of nodes
edgelist = pd.read_csv('cca.csv')
M = nx.from_pandas_dataframe(edgelist, 'source', 'target', create_using = nx.MultiGraph())

In [14]:
# Create weighted graph from M
G = nx.Graph()
for u,v,data in M.edges(data=True):
    w = data['weight'] if 'weight' in data else 1.0
    if G.has_edge(u,v):
        G[u][v]['weight'] += w
    else:
        G.add_edge(u, v, weight=w)

In [13]:
G

<networkx.classes.graph.Graph at 0x114615f98>

In [15]:
fc =ForceChart(G,charge=-100,link_distance=50, link_color_field='link_color',
              node_color_field='node_color', node_radius_field='node_size')
fc.to_notebook()