This is the raw dictionary of dictionaries containing all the abstracts. Notice the need to use the encoding statement.

In [6]:
import json

fn = 'C:/iPython Notebook/Data/JSON/full_nano_JSON.txt'
with open(fn, encoding = 'UTF-8') as fh:
    corpus = json.load(fh)


# Listing content


Quite often, the first thing to do after having collected the data is to make a few lists to get a feel for the data. So, what are the most frequently occuring journals in the dataset? What are the most frequently occuring authors in the dataset? etc.

This is a straight forward operation in Python because it provides us with a nice counter straight out of the box. 

In [7]:
import collections
counter = collections.Counter()

for record in corpus.values():
     counter[record['J9']] += 1

A counter in python is an extension of the default dictonary. It comes with a couple of additional useful features. The first is the most_common method. This method returns a list with the items and their frequency. We can specify the number of most common items by using the 'n' keyword argument

In [8]:
counter.most_common(n=20)

[('', 1481),
 ('ABSTR PAP AM CHEM S', 414),
 ('P SOC PHOTO-OPT INS', 374),
 ('ACS NANO', 351),
 ('PROC SPIE', 346),
 ('ANGEW CHEM INT EDIT', 344),
 ('J NANOPART RES', 329),
 ('J NANOSCI NANOTECHNO', 311),
 ('NANOTECHNOLOGY', 304),
 ('INT J NANOMED', 232),
 ('APPL PHYS LETT', 230),
 ('NANO LETT', 222),
 ('MATER TODAY', 217),
 ('NANOMEDICINE-UK', 196),
 ('J AM CHEM SOC', 183),
 ('IEEE T NANOTECHNOL', 177),
 ('LANGMUIR', 171),
 ('NANOSCALE', 169),
 ('SMALL', 166),
 ('CHEM ENG NEWS', 164)]

Counters also can be added or substracted from one another. So imagine you have two counters, A and B, 

we can then write

```Python

c = a - b

```

Here we substract counter b from counter a. Implicitly here, python will loop over the items in both counters and substract the items in b from those in a. If an item does not exist in a or b, it is handled as being zero.

## authors

Let's move to a slightly more complicated example. Imagine we would like to count the frequency of authors in the dataset. The author information in ISI is stored under the 'AU' key

In [48]:
# <-- would be nice to have a standard naming convention

authors = corpus.get(list(corpus.keys())[0])['AU']
print(authors)

Williams, RM; Shah, J; Ng, BD; Minton, DR; Gudas, LJ; Park, CY; Heller, DA


As an example, I show above the authors of the first item in the corpus. If we look at this carefully, we see that all the authors are stored in a single string. Different authors are seperated by a semicolon. So, if we want to count the frequency of authors in the dataset, we need to do some basci string handling. We need to split the string on the semicolon like this

In [18]:
list_of_authors = authors.split(';')
list_of_authors

['Wang, BB', ' Ostrikov, K']

This split operation returns a list with one or more items. As you can see, sometimes there is some white space before an author name, and we can also have some inconsistency with respect to upper and lower case use. So let us clean up the string a bit more. We are going to trim of the white space and force everything to lower case.

In [19]:
list_of_authors = [author.strip() for author in list_of_authors]
list_of_authors = [author.lower() for author in list_of_authors]
list_of_authors

['wang, bb', 'ostrikov, k']

We now have consistent looking representaitons of the author names. This list can be used in combination with a Counter as we did when counting journals

In [20]:
counter = collections.Counter()

for record in corpus.values():
    authors = record['AU']
    list_of_authors = authors.split(';')
    list_of_authors = [author.strip() for author in list_of_authors]
    list_of_authors = [author.lower() for author in list_of_authors]    
    
    for author in list_of_authors:
         counter[author] += 1

counter.most_common(n=20)

[('[anonymous]', 569),
 ('webster, tj', 107),
 ('liu, y', 99),
 ('wang, j', 81),
 ('wang, y', 74),
 ('seeman, nc', 69),
 ('feng, ss', 68),
 ('li, j', 64),
 ('zhang, y', 61),
 ('li, y', 57),
 ('guo, px', 55),
 ('ferrari, m', 55),
 ('roco, mc', 53),
 ('yan, h', 52),
 ('liu, j', 50),
 ('kim, j', 47),
 ('kumar, s', 46),
 ('chen, y', 45),
 ('lee, j', 44),
 ('ostrikov, k', 42)]

We now have a nice overview of the frequency of authors in our dataset. It is not uncommon that the same author exists under slightly different names in the dataset. Sometimes an author's first name might be written out, in other cases initials are used. At the end of this notebook, we return to this problem and demostrate a way of handling this. 

## countries

Let's move to a more complicated example were we have to move beyond basic string handling to resolve the issue. Let's say that we would like to know the frequency of the countries in the dataset. How many articles are being published by a given country? This is a complicated issue for several reasons:
* we need to determine the country for each author on a given paper
* we need to count each country only once for each paper
* how do we handle papers written by people from different countries?
We now are going to adress these issues step by step.

First, let's start with the country information. ISI provides detailed affiliation information for each author via the C1 tag.




In [22]:
# <-- some python 2X in here
# <-- you wont get the same article twice
affil = corpus.get(list(corpus.keys())[2])['C1']
print(affil)

Univ Virginia, Sch Engn & Appl Sci, Div Technol Culture & Commun, Charlottesville, VA 22904 USA


Like with the author example, semicolons are being used as a seperator. So a naive approach would be to split on the semicolon.

In [24]:
affil.split(';')

['Univ Virginia, Sch Engn & Appl Sci, Div Technol Culture & Commun, Charlottesville, VA 22904 USA']

We do get a nice looking list, but if you look a bit more careful you see that it is rather messy. We have a mix of authors but also two entries that contain both an author name as well as an affiliation. We could do some more string handling to resolve this, but in this case the use of a regular expression might be much more elegant. 

If we look carefully at the affiliation field, we see that we have a nice pattern. There is a list of authors between square brackets, followed by the adress information.  This pattern is seperated by a semicolon. If we use a regular expression to match on this pattern, we can use this for splitting instead. So, we want to mach to the brackets and their elements in between. 

In [25]:
import re

regexp = re.compile(r'\[(.*?)\]')
elements = re.split(regexp, affil)
for element in elements:
    print(element)

Univ Virginia, Sch Engn & Appl Sci, Div Technol Culture & Commun, Charlottesville, VA 22904 USA


This regular expression produces nice clean results. We have a list of itemts that alternates between the list of authors that was between the square brackets and the affiliation information associated with this set of authors. Also, the first item is an empty string, so let's first remove the empty string, and then reform the list such that authors and affiliations are explicitly connected

In [26]:
elements = [entry for entry in elements if entry]

author_affil = zip(elements[0::2], elements[1::2])
author_affil

<zip at 0x1cbb94c8>

we remove the empty string using a filter operation inside the list expression. An empty string evaluates always to False in Python and so get's dropped off. Next, we create a new list of tuples where the first item is the list of authors and the second item is their affiliation adress. We use the zip function for this, in combination with a slice notation. This slice notation starts at either 0, for the authors, or 1 for their adress. It goes to the end of the list with a stepsize of 2. 

The final step is to take this and reformat it such that we have a mapping between an author and an affiliation. To to this, we iterate over the new list, split the first entry by semicolon, and match each author to the associated affiliation.

In [12]:
author_affiliation_map = collections.defaultdict(list)

for authors, affiliation in author_affil:
    list_of_authors = authors.split(';')
    list_of_authors = [author.strip() for author in list_of_authors]
    list_of_authors = [author.lower() for author in list_of_authors]
    
    for author in list_of_authors:
        author_affiliation_map[author].append(affiliation)

for key, value in author_affiliation_map.items():
    print(key, value)

subramanian, balajikarthick [u' Univ Massachusetts, Ctr Hlth & Dis Res, Lowell, MA 01854 USA; ', u' Univ Massachusetts, Biomed Engn & Biotechnol Program, Lowell, MA 01854 USA; ']
yoganathan, subbiah [u' Forsyth Inst, Boston, MA USA']
wilson, thomas [u' Univ Massachusetts, Ctr Hlth & Dis Res, Lowell, MA 01854 USA; ']
kotyla, tim [u' Univ Massachusetts, Ctr Hlth & Dis Res, Lowell, MA 01854 USA; ']
kuo, fonghsu [u' Univ Massachusetts, Ctr Hlth & Dis Res, Lowell, MA 01854 USA; ', u' Univ Massachusetts, Biomed Engn & Biotechnol Program, Lowell, MA 01854 USA; ']
nicolosi, robert [u' Univ Massachusetts, Ctr Hlth & Dis Res, Lowell, MA 01854 USA; ', u' Univ Massachusetts, Biomed Engn & Biotechnol Program, Lowell, MA 01854 USA; ']
ada, earl [u' Univ Massachusetts, Mat Characterizat Lab, Lowell, MA 01854 USA; ']


Note that it might be possible for a given author to have more than one affiliation. We therefore are using a defaultdict with lists and append to this list. 

Now that we have a nice mapping of authors to affiliations, let's turn to the next problem: extracting the country name from the affiliation. First, let's look at some affiliations


In [28]:
for i in range(4):
    affil = corpus.get(list(corpus.keys())[i])['C1']
    elements = re.split(regexp, affil)
    elements = [entry for entry in elements if entry]
    for entry in elements[1::2]:
        print(entry)


 Mem Sloan Kettering Canc Ctr, New York, NY 10065 USA; 
 Weill Cornell Grad Sch Med Sci, New York, NY 10065 USA; 
 Weill Cornell Med Coll, Dept Pharmacol, New York, NY 10065 USA; 
 Weill Cornell Med Coll, Dept Cell & Dev Biol, New York, NY 10065 USA
 Kanazawa Univ, Grad Sch Nat Sci & Technol, Div Math & Phys Sci, Kanazawa, Ishikawa 9201192, Japan; 
 Natl Inst Adv Ind Sci & Technol, Res Inst Computat Sci, Tsukuba, Ibaraki 3058568, Japan


As we can see, the country is at the end of the string, but sometimes there is some non-alphanumeric stuff there as well. So the first step is to remove that. Next, we can split it on a comma and only keep the last element

In [29]:
for i in range(4):
    affil = corpus.get(list(corpus.keys())[i])['C1']
    elements = re.split(regexp, affil)
    elements = [entry for entry in elements if entry]
    for entry in elements[1::2]:
        entry = entry.strip('.; ')
        country = entry.split(', ')[-1]
        print(country)

NY 10065 USA
NY 10065 USA
NY 10065 USA
NY 10065 USA
Japan
Japan


Okay, we are making progress, but the USA format is different from that of the other countries. It contains the state, sometimes a postal code, and the country. We only need the last part and not the rest. A naive solution would be to split on white space, but this would break for country names like South Korea or Peoples Republic of China. Instead, we are going to use a simple regular expression. The patter is simple, two capital letters, optionally followed by 5 numbers, followed by three capital letters. These numbers don't always occur, they are either there or not. This implies also that the second space is optional.

In [30]:
country_regexp = re.compile(r'([A-Z]{2})\s([\d]{0,5})\s?([A-Z]{3})')

for i in range(4):
    affil = corpus.get(list(corpus.keys())[i])['C1']
    elements = re.split(regexp, affil)
    elements = [entry for entry in elements if entry]
    for entry in elements[1::2]:
        entry = entry.strip('.; ')
        country = entry.split(', ')[-1]
        match = re.match(country_regexp, country)
        if match:
            country = match.group(3)
        
        print(country)

USA
USA
USA
USA
Japan
Japan


Thinks are clearing up nicely. So far, we have tested on the first 4 records (note the 4 as an argument to range). Let's expand the number of records a bit, and avoid showing countries multiple times.

In [31]:
countries = set()

for i in range(50):
    affil = corpus.get(list(corpus.keys())[i])['C1']
    elements = re.split(regexp, affil)
    elements = [entry for entry in elements if entry]
    for entry in elements[1::2]:
        entry = entry.strip('.; ')
        country = entry.split(', ')[-1]
        match = re.match(country_regexp, country)
        if match:
            country = match.group(3)
        
        countries.add(country)
        
for entry in countries:
    print(entry)

South Korea
Russia
France
South Africa
Israel
Australia
Japan
Switzerland
Singapore
Greece
Hungary
Peoples R China
India
Iran
Slovenia
Czech Republic
USA
Sweden
Serbia
Egypt
Italy
Germany
Netherlands


This looks alright. We tested the regular expression on the first 50 records, and no strange results show up. Let's therefore combine the handing of the affiliation with the matching of the affiliation with authors. To do this, we are introducting a new function get_country which accepts the affiliation and returns the country

In [33]:
country_regexp = re.compile(r'([A-Z]{2})\s([\d]{0,5})\s?([A-Z]{3})')
affil_regexp = re.compile(r'\[(.*?)\]')

def get_country(affiliation):
    entry = affiliation.strip('.; ')
    country = entry.split(', ')[-1]
    match = re.match(country_regexp, country)
    if match:
        country = match.group(3)
    return country

def process_affiliations(record):
    affil = record['C1']
    elements = re.split(regexp, affil)
    elements = [entry for entry in elements if entry]
    
    author_affil = zip(elements[0::2], elements[1::2])

    author_affiliation_map = collections.defaultdict(list)

    for authors, affiliation in author_affil:
        list_of_authors = authors.split(';')
        list_of_authors = [author.strip() for author in list_of_authors]
        list_of_authors = [author.lower() for author in list_of_authors]

        for author in list_of_authors:
            country = get_country(affiliation)
            author_affiliation_map[author].append(country)
    return author_affiliation_map

for i in range(2):
    record = corpus.get(list(corpus.keys())[i])
    
    print(process_affiliations(record))


defaultdict(<class 'list'>, {'ng, brandon d.': ['USA'], 'heller, daniel a.': ['USA', 'USA'], 'shah, janki': ['USA'], 'williams, ryan m.': ['USA'], 'gudas, lorraine j.': ['USA'], 'minton, denise r.': ['USA'], 'park, christopher y.': ['USA', 'USA']})
defaultdict(<class 'list'>, {'saito, mineo': ['Japan'], 'ishii, fumiyuki': ['Japan', 'Japan'], 'sawada, keisuke': ['Japan']})


With these results, we could start doing a simple count, like we have done in the previous examples. However, this would result counting the same paper multiple times if the authors come from the same country. We need to do something else instead. The easiest step is to first make a matrix of records by countries. This indicates for each record the country of origin of the various authors. 

In [34]:
counted_records = {}

for key, record in corpus.items():
    affiliation_map = process_affiliations(record)
    
    counter = collections.Counter()
    for author, countries in affiliation_map.items():
        
        for country in countries:
            counter[country] += 1 
    counted_records[key] = counter
    

In [35]:
import pandas as pd

df = pd.DataFrame.from_dict(counted_records, orient='index')
df = df.fillna(0)
df[0:10]

Unnamed: 0,USA,Japan,Iran,Czech Republic,Egypt,Switzerland,Peoples R China,Germany,Sweden,South Korea,...,Yuchun,Jay,Azerbaijan,J. P,B. S,Niger,S. G,J. B,R. O,Morocco
WOS:000202991501296,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WOS:000202995300850,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WOS:000202995300852,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WOS:000202998600001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WOS:000203538900008,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WOS:000205747600003,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WOS:000205747600006,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WOS:000205747600009,6,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WOS:000205747600011,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WOS:000205747800001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# <-- something went wrong here; it looks like some of the addresses and names got mixed. 

We are now almost there. We have a nice matrix that specifies the frequency of countries for each record. To count a paper for a given paper only once, we can use logical indexing. If the frequency is higher than 1, set it to one. Next, sum across columsn to produce the frequency of countries.

In [36]:
df[df>1] = 1
country_freqs = df.sum()
country_freqs.sort(ascending=False)
country_freqs[0:10]

USA                5556
Peoples R China    2237
Germany            1132
India              1006
England             853
Italy               830
Japan               788
France              718
Spain               610
South Korea         596
dtype: float64

Look carefully at the above code. The sum operation returns a series. This series we sort in descending order to get the most frequently occuring countries.

As you might recall, there were three issues we had to face when determining the number of articles per country. These were 
* determine the country for each author on a given paper
* count each country only once for each paper
* how do we handle papers written by people from different countries? 
We addressed the first issue through the careful splitting of the string, first to match each author with the correct affilition, and second to extract the country from the affiliation. The second issue we addressed through logical indexing on the pandas dataframe. In the sum across the columsn of the data frame, we implicitly made an assumption with respect to the third item: we attributed an article with multiple countries once to each country. This is an easy solution, but be aware of the fact that there is quite a literature on fractionating instead. 

## citations

In [38]:
cites = corpus.get(list(corpus.keys())[2])['CR']
cites = cites.split('; ')
for entry in cites:
    print(entry.lower())

allenby b, 2001, ieee technology soc, v19, p10
anderson j. r., 1983, architecture cogniti
baird d., 1999, perspectives sci, v7, p231, doi 10.1162/posc.1999.7.2.231
bechtel w., 1991, connectionism mind i
bijker w.e., 1995, bicycles bakelites b
bowker geoffrey c., 1999, sorting things class
chi mth, 1981, cognitive sci, v5, p121, doi 10.1207/s15516709cog0502_2
collins hm, 2002, soc stud sci, v32, p235, doi 10.1177/0306312702032002003
dreyfus h. l., 1986, mind machine power h
epstein s, 1995, sci technol hum val, v20, p408, doi 10.1177/016224399502000402
ericsson ka, 1994, am psychol, v49, p725, doi 10.1037/0003-066x.49.8.725
friedman t., 1999, lexus olive tree
galison peter, 1997, image logic mat cult
goodman peter s., 2003, washington post 0104, pa01
gorman m. e., 2002, j technology transfe, v27, p219, doi 10.1023/a:1015672119590
gorman me, 2002, j eng educ, v91, p339
gorman me, 1997, soc stud sci, v27, p583, doi 10.1177/030631297027004002
gorman m.e., 2000, ethical env challeng
gorman 

# Group by


Often we would like to combine information from different fields on a record. For example, we might want to now the frequency with which countries publish in different journals. That is, we want a matrix of country by journal. This requires combining the information from the journal field with the information from the affiliation field. It is generally convenient to use pandas in this. 


In [39]:
counted_records = {}

for key, record in corpus.items():
    affiliation_map = process_affiliations(record)
    
    counter = collections.Counter()
    for author, countries in affiliation_map.items():
        
        for country in countries:
            counter[country] += 1 
    counted_records[key] = counter
    
record_by_country = pd.DataFrame.from_dict(counted_records, orient='index')
record_by_country = record_by_country.fillna(0)
record_by_country[record_by_country>1] = 1

keys = {key:record['J9'] for key, record in corpus.items()}

In [40]:
country_by_journals = record_by_country.groupby(keys).sum().T
print(country_by_journals.ix[0:10,0:5])

                      AAPS J  AAPS PHARMSCITECH  AASRI PROC  AATCC REV
USA              182       6                  6           0          0
Japan             18       0                  0           0          0
Iran              19       0                  0           0          0
Czech Republic    35       0                  0           0          0
Egypt              2       0                  0           0          0
Switzerland        7       0                  0           0          0
Peoples R China   53       0                  1           0          2
Germany           28       0                  0           0          0
Sweden             2       0                  0           0          0
South Korea       12       0                  0           0          0


In the above group by example, we are grouping on a unique article identifier, namely the journal. It is also possible to use group by in combination with non unique article identifiers like stop words. However, to do so is a bit more complicated. Imagine we want to profile countries by the subject categories of articles published by authors from those countries. The problem we now face is that an article can have more than one subject category, which creates a problem with group by. The solution is to make to matrices: article by country, and article by subject category. Next, we merge these two. After the merge, we can use a group by. Let's go through this step by step.

In [41]:
sc_recs = []

for key, rec in corpus.items():
    subject_categories = rec['SC'].split('; ')
    for sc in subject_categories:
        sc_recs.append({'UT':key, 'SC':sc})
df_sc = pd.DataFrame(sc_recs)
print(df_sc[0:5])



                                    SC                   UT
0                            Chemistry  WOS:000352750200022
1  Science & Technology - Other Topics  WOS:000352750200022
2                    Materials Science  WOS:000352750200022
3                              Physics  WOS:000352750200022
4                              Physics  WOS:000257272600028


we now have created two dataframes. The first contains the records indexed by country. The second contains the records indexed by subject category. 

We can now combine these two. This is knowns as a merge in pandas. In the merge function, we specify a left hand and right hand dataframe that we want to merge. We also need to specify what it is that we want to merge on. In this particular case, we would like to merge on the unique article identifier. This will result in a new dataframe which combines the information from the left and right dataframe. Because we are joining on the UT tag and the subject category dataframe contains specific UT tags more than once, we end up with a new data frame where a given article can also occur more than once.

In order to merge on the UT tag, we first modify the country dataframe we produced earlier. We are moving the current index which is the UT tag to a seperate column and make sure this new column has the correct label.

In [42]:
df_country = record_by_country.reset_index(level=0)
df_country = df_country.rename(columns = {'index':'UT'})
df_country[0:5]

Unnamed: 0,UT,USA,Japan,Iran,Czech Republic,Egypt,Switzerland,Peoples R China,Germany,Sweden,...,Yuchun,Jay,Azerbaijan,J. P,B. S,Niger,S. G,J. B,R. O,Morocco
0,WOS:000202991501296,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,WOS:000202995300850,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,WOS:000202995300852,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,WOS:000202998600001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,WOS:000203538900008,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We now have to dataframes, both of which have a UT column. So let's merge the two data frames on the UT column.

In [43]:
df = pd.merge(df_country, df_sc, on=['UT'])
print(df.ix[0:10, 0:5])

                     UT  USA  Japan  Iran  Czech Republic
0   WOS:000202991501296    1      0     0               0
1   WOS:000202995300850    0      0     0               0
2   WOS:000202995300852    0      0     0               0
3   WOS:000202998600001    0      0     0               0
4   WOS:000202998600001    0      0     0               0
5   WOS:000202998600001    0      0     0               0
6   WOS:000203538900008    0      0     0               0
7   WOS:000203538900008    0      0     0               0
8   WOS:000205747600003    1      0     0               0
9   WOS:000205747600003    1      0     0               0
10  WOS:000205747600006    0      0     0               0


Remember, the aim was to produce an country by subject category matrix. We are almost there. We now can use groupby on the country column to produce the desired grouping. Before we do so, it might be useful to first drop the index column because we no longer need it. For this, we can use the drop method on the dataframe. We can specify here the name of the column we would like to drop and the axis along which we want to drop. Since we are not going to need the dataframe with the index column any more, we do the operation in place. That is, we modify the dataframe itself, rather than copying all the data to a new dataframe. 

In [44]:
df.drop('UT', axis=1, inplace='True')

As a final step we now do the group by on authors and sum up.

In [45]:
df_country_sc = df.groupby('SC').sum()
print(df_country_sc.T.ix[0:10,0:5])

SC                  Acoustics  Agriculture  Allergy  Anatomy & Morphology
USA              0          6           25        0                     0
Japan            0          0            0        1                     0
Iran             0          0            4        0                     0
Czech Republic   1          0            0        0                     1
Egypt            0          0            0        0                     0
Switzerland      1          0            1        0                     0
Peoples R China  0          1           11        0                     2
Germany          1          1            7        0                     1
Sweden           0          0            0        0                     0
South Korea      0          0            3        0                     0


# cleaning the data

It is almost always necessary to perform some data cleaning at some stage when performing text mining. For example, the same person might be present in the data set under different names; the same organization might be present under slightly different names; or words might be not particularly relevant for the analysis. Data cleaning can be done at different stages of the analysis. For example, in the previous chapter, we did not index a set of stopwords. This represents a form of early cleaning. Alternatively, we can use the drop function to drop columns from the dataframe at a later stage of the analysis. 

As an example, let's start with making a record by author dataframe.

In [46]:
# let's count the authors
# because we don't want to index all authors
# also keep a global counter
authors_counters = {}
global_counter = collections.Counter()
for key, record in corpus.items():
    authors = record['AU']
    list_of_authors = authors.split('; ')
    list_of_authors = [author.strip() for author in list_of_authors]
    list_of_authors = [author.lower() for author in list_of_authors]    
    
    counter = collections.Counter()
    
    for author in list_of_authors:
        counter[author] += 1
        global_counter[author] += 1
    authors_counters[key] = counter

# get the top 250 most frequenlty occuring oauthors
top_250 = global_counter.most_common(n=250)
top_250 = collections.Counter(dict(top_250))

# for all records only keep authors in the top 250
indexed_authors = {}
for key, value in authors_counters.items():
    # we can do an intersect on the keys of two
    # dictionaries
    intersect = value & top_250
    indexed_authors[key] = intersect

# turn it into a dataframe
rec_authors = pd.DataFrame.from_dict(indexed_authors, orient='index')
rec_authors.fillna(0, inplace=True)

print(rec_authors.sum()[0:20])

baglioni, p      20
kumar, r         30
wood, j          32
kim, hj          17
yoshida, y       16
wang, zl         38
kim, h           23
kim, k           29
jiang, l         25
chen, y          45
park, j          23
liu, h           24
gordon, r        16
ostrikov, k      42
farokhzad, oc    22
langer, r        38
lee, s           28
park, jh         16
park, sh         24
ferrari, m       55
dtype: float64


So, as an example, let's say that we discover that the authors labelled 'yang, l' and 'yang, h' are actually the same. In that case, we would like to combine the two into a single row. This can again be accomplished as a simple group by operation. However, rather than producing the column on which to group up front as we did with the country by journal example, or by including it as column in the dataframe as we did with the country by subject category example, here we are using a function. This function receives the index for a given row or the column label, and should return the group to which that index or column belongs. 

This function can be done as a simply lookup in a thesaurus. The thesaurus is a dict with the various author names as key, and the correct name as value. The function tries to return a value from the thesaurus. If this fails with a key error, this means that the index is not in the thesaurus and should be treated as its own group instead.

In [47]:
thesaurus = {'yang, l': 'yang, h'}

def thesaurus_lookup(author):
    try:
        return thesaurus[author]
    except KeyError:
        return author

rec_authors = rec_authors.groupby(thesaurus_lookup, axis=1).sum()
print(rec_authors.sum()['yang, h'])

45.0


What is the best point for cleaning the data depends on the exact use case. Removing stop words is a typical thing, which is easily accomplished early. In contrast, cleaning data by merging columns or rows is often something that you discovery somewhere in the data analysis process. In those cases, it is easier to perform the cleaning later once it is discovered that it is needed. A second argument for cleaning later rather than early is that you keep your data in its original form as long as possible.