# Check if parsing of original (downloaded from [here](http://panchenko.me/data/joint/dt-wiki-deps-jst-wpf1k-fpw1k-thr.csv-cw-e0-N200-n200-minsize15.csv.gz)) JB wiki clusters as implemented in TWSI_evaluation works correctly. For example, word column should contain words only. No mixed types errors sould occur.

In [1]:
from pandas import read_csv

user_inventory_fpath = "/home/pelevina/experiment/intermediate/wiki-clusters-dep-cw-e0-N200-n200-minsize15-count930408.csv"

user_inventory = read_csv(user_inventory_fpath, sep="\t", encoding='utf8', header=None, names=["word","sense_id","cluster"])

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
user_inventory.iloc[10].word


u'0'

In [4]:
user_inventory.iloc[0]

word            cid
sense_id    cluster
cluster        isas
Name: word, dtype: object

### File has a false header of 4 elements: word, cid, cluster, isas.
### This caused two probelms: 1. first and second raw incorrectly parsed. 2. Word column was interpreted as index column.
### Change header to 'word, cid, cluster', repeat parsing.

In [5]:
user_inventory = read_csv(user_inventory_fpath, sep="\t", encoding='utf8', header=None, names=["word","sense_id","cluster"])

  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
user_inventory.iloc[10].word

u'Maximilian'

### Problem solved, now check the word column

In [7]:
user_inventory.head

<bound method DataFrame.head of                                                      word sense_id  \
0                                                    word      cid   
1                                                     (ep        2   
2                                                     (ep        3   
3                                      minister-president        1   
4                                      minister-president        3   
5                                      minister-president        5   
6                                      Minister-President        0   
7                                      Minister-President        1   
8                                      Minister-President        2   
9                                                    Otto        0   
10                                             Maximilian        0   
11                                             Maximilian        1   
12                                                elector 

In [9]:
user_inventory[user_inventory.isnull().any(axis=1)].shape

(14508, 3)

### Clearly, some lines are parsed incorrectly (e.g. 915852). This is most probably due to doublequotes appearing in some words while doublequote is a default escape character for read_csv.  There are 14508 lines that have NaN values as a result. Change reading parameters and repeat.

In [10]:
user_inventory = read_csv(user_inventory_fpath, sep="\t", encoding='utf8', header=None, names=["word","sense_id","cluster"], doublequote=False, quotechar=u"\u0000")

In [11]:
user_inventory.head


<bound method DataFrame.head of                          word sense_id  \
0                        word      cid   
1                         (ep        2   
2                         (ep        3   
3          minister-president        1   
4          minister-president        3   
5          minister-president        5   
6          Minister-President        0   
7          Minister-President        1   
8          Minister-President        2   
9                        Otto        0   
10                 Maximilian        0   
11                 Maximilian        1   
12                    elector        0   
13                    elector        1   
14                    elector        2   
15                    elector        3   
16                     Amalie        0   
17                   Palatine        0   
18                   Palatine        1   
19                   Palatine        2   
20                    Duchess        0   
21                   Dorothea        0   
22

### Now looks all lines are parsed correctly. Row number is correct. Words with doublequotes appear in the cluster. Check for NaN:

In [13]:
user_inventory[user_inventory.isnull().any(axis=1)]

Unnamed: 0,word,sense_id,cluster
108795,,1,"SH:0.017,NH:0.017,CR:0.015,VT:0.015,HB:0.015,N..."
108796,,2,"JA:0.034,JM:0.033,RA:0.029,DA:0.029,JL:0.028,A..."
181715,,10,"Mazda:0.004,n/a:0.004,turbo:0.004,NA:0.004,tur..."
181716,,23,"KS:0.003,STV:0.003,ICAO:0.003,EIS:0.003,HEX:0...."
300525,,3,"Nan:0.012,Tian:0.005,Cong:0.004,bao:0.004,zhi:..."
300526,,9,"Eilean:0.028,Coire:0.015,Mòr:0.014,Dearg:0.013..."
300527,,10,"pancake:0.005,flatbread:0.004,ga:0.004,bread:0..."
508753,,3,"dislocation:0.003,disturbance:0.002,gradient:0..."
508754,,14,"−1:0.004,""y"":0.003,""i"":0.002,1.0):0.002,>:0.00..."
549378,,1,"GB:0.003,byte:0.003,matching:0.003,sorting:0.0..."


### Manual error analysis: in all these lines the word itself is spelled 'nan'.

# Conclusion

### Change parameters of TWSI reading of *user inventory* and of *TWSI dataset*. The later is relevant, because cluster words with doublequotes could get into the predicted_related_terms column through neighbours of sense vectors built on wiki JB clusters.