<a href="https://colab.research.google.com/github/trkfz2/M2020/blob/master/NLP_assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Load and prepare data**

## 1. Load the dataset into a pandas dataframe.

In [0]:
import nltk
import pandas as pd
import re

In [2]:
!wget https://alexip-ml.s3.amazonaws.com/stackexchange_812k.csv.gz

--2020-03-02 12:17:31--  https://alexip-ml.s3.amazonaws.com/stackexchange_812k.csv.gz
Resolving alexip-ml.s3.amazonaws.com (alexip-ml.s3.amazonaws.com)... 52.216.98.187
Connecting to alexip-ml.s3.amazonaws.com (alexip-ml.s3.amazonaws.com)|52.216.98.187|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 130230720 (124M) [application/x-gzip]
Saving to: ‘stackexchange_812k.csv.gz’


2020-03-02 12:17:36 (31.8 MB/s) - ‘stackexchange_812k.csv.gz’ saved [130230720/130230720]



In [0]:
!gunzip stackexchange_812k.csv.gz

In [0]:
df = pd.read_csv("stackexchange_812k.csv")

In [5]:
df.shape

(812132, 5)

In [6]:
df.columns

Index(['post_id', 'parent_id', 'comment_id', 'text', 'category'], dtype='object')

In [7]:
df.dtypes

post_id         int64
parent_id     float64
comment_id    float64
text           object
category       object
dtype: object

## 2. Use regular expressions to remove elements that are not words
 such as:
 - html tags, 
 - latex expressions, 
 - urls, 
 - digits, 
 - line returns, …


In [0]:
def remove_html_tags(text):
    """Remove html tags from a string"""  
    clean = re.compile('<.*?>') 
    return re.sub(clean, '', text)

In [0]:
def remove_latex(text):
    """Remove html tags from a string"""  
    clean = re.compile('\$.*?\$') 
    return re.sub(clean, '', text)


In [0]:
def remove_url(text):
    """"Remove urls from a string"""  
    text = re.sub(r'(http\S+|ftp\S+|www\S+)', '', text) 
    return(text)

In [0]:
def remove_specialchar(text):
    """"Remove urls from a string"""  
    text = re.sub('[^A-Za-z,.!? ]+', '', text) #'[^A-Za-z0-9 .!,:?]+'
    return(text)

 Remove:

In [0]:
# remove html
df['text'] = df['text'].apply(lambda x: remove_html_tags(x))

In [0]:
# remove latex
df['text'] = df['text'].apply(lambda x: remove_latex(x))

In [0]:
# remove url
df['text'] = df['text'].apply(lambda x: remove_url(x))

In [0]:
# remove special characters 
df['text'] = df['text'].apply(lambda x: remove_specialchar(x))

## 3. Remove missing values for texts

In [0]:
import numpy as np

In [0]:
na = df['text'].isna()

In [19]:
sum(na == False) == len(df['text'])

True

## 4. Remove texts that are extremely large or too short
 to bring any information to the model. We want to keep paragraphs that contain at least a few words and remove the paragraphs that are composed of large numerical tables.

In [0]:
text_length = df['text'].apply(lambda x: len(x))

In [23]:
min(text_length), max(text_length)

(0, 21933)

Short text (number of characters < 30):

In [0]:
short_text = df[text_length < 30]

In [25]:
short_text.shape

(27934, 5)

Show some examples:

In [26]:
df['category'].unique()

array(['title', 'post', 'comment'], dtype=object)

In [27]:
short_text[short_text.category == 'post'][0:10]

Unnamed: 0,post_id,parent_id,comment_id,text,category
91896,259,258.0,,Support vector machine,post
91927,326,114.0,,XIANS OG,post
91985,435,423.0,,One more Dilbert cartoon...,post
92036,546,423.0,,Another one from xkcd,post
92105,656,423.0,,,post
92746,1890,1883.0,,The creation of this site,post
92761,1920,1904.0,,ACM SIGKDD KDD in San Diego,post
92970,2293,1906.0,,NIPS,post
93046,2431,2423.0,,All of Statistics,post
93210,2729,2728.0,,What about silhouette?,post


In [28]:
short_text[short_text.category == 'comment'][0:10]

Unnamed: 0,post_id,parent_id,comment_id,text,category
259062,4,,15949.0,A related question,comment
259076,7,,800725.0,,comment
259082,9,,2723.0,James,comment
259099,18,,7.0,also the US census data,comment
259110,26,,127496.0,Here is a good explanation,comment
259121,30,,245.0,Similar question on SO,comment
259133,36,,18.0,,comment
259149,43,,206.0,"fine, you win",comment
259151,44,,362884.0,is this your homework?,comment
259205,75,,85.0,,comment


In [29]:
short_text[short_text.category == 'title'][0:10]

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,1,,,Eliciting priors from experts,title
1,2,,,What is normality?,title
11,23,,,Finding the PDF given the CDF,title
13,26,,,What is a standard deviation?,title
19,44,,,Explain data visualization,title
37,138,,,Free resources for learning R,title
38,145,,,Free Dataset Resources?,title
42,170,,,Free statistical textbooks,title
57,249,,,Variance components,title
63,290,,,Resources for learning Stata,title


Remove short text

In [0]:
df = df[~df.index.isin(short_text.index)]

In [31]:
df.shape

(784198, 5)

In [32]:
new_length = df['text'].apply(lambda x: len(x))
min(new_length), max(new_length)

(30, 21933)

Long text (number of characters > 10000):

In [0]:
long_text = df[text_length > 10000]

In [34]:
long_text.shape

(143, 5)

Show some examples

In [0]:
pd.options.display.max_colwidth = max(text_length)

In [36]:
long_text[long_text.category == 'post'][0:2]

Unnamed: 0,post_id,parent_id,comment_id,text,category
92968,2287,2272.0,,"I agree completely with Srikants explanation. To give a more heuristic spin on itClassical approaches generally posit that the world is one way e.g., a parameter has one particular true value, and try to conduct experiments whose resulting conclusion no matter the true value of the parameter will be correct with at least some minimum probability.As a result, to express uncertainty in our knowledge after an experiment, the frequentist approach uses a confidence interval a range of values designed to include the true value of the parameter with some minimum probability, say . A frequentist will design the experiment and confidence interval procedure so that out of every experiments run start to finish, at least of the resulting confidence intervals will be expected to include the true value of the parameter. The other might be slightly wrong, or they might be complete nonsense formally speaking thats ok as far as the approach is concerned, as long as out of inferences are correct. Of course we would prefer them to be slightly wrong, not total nonsense.Bayesian approaches formulate the problem differently. Instead of saying the parameter simply has one unknown true value, a Bayesian method says the parameters value is fixed but has been chosen from some probability distribution known as the prior probability distribution. Another way to say that is that before taking any measurements, the Bayesian assigns a probability distribution, which they call a belief state, on what the true value of the parameter happens to be. This prior might be known imagine trying to estimate the size of a truck, if we know the overall distribution of truck sizes from the DMV or it might be an assumption drawn out of thin air. The Bayesian inference is simpler we collect some data, and then calculate the probability of different values of the parameter GIVEN the data. This new probability distribution is called the a posteriori probability or simply the posterior. Bayesian approaches can summarize their uncertainty by giving a range of values on the posterior probability distribution that includes of the probability this is called a credibility interval.A Bayesian partisan might criticize the frequentist confidence interval like this So what if out of experiments yield a confidence interval that includes the true value? I dont care about experiments I DIDNT DO I care about this experiment I DID DO. Your rule allows out of the to be complete nonsense negative values, impossible values as long as the other are correct thats ridiculous.A frequentist diehard might criticize the Bayesian credibility interval like this So what if of the posterior probability is included in this range? What if the true value is, say, .? If it is, then your method, run start to finish, will be WRONG of the time. Your response is, Oh well, thats ok because according to the prior its very rare that the value is ., and that may be so, but I want a method that works for ANY possible value of the parameter. I dont care about values of the parameter that IT DOESNT HAVE I care about the one true value IT DOES HAVE. Oh also, by the way, your answers are only correct if the prior is correct. If you just pull it out of thin air because it feels right, you can be way off.In a sense both of these partisans are correct in their criticisms of each others methods, but I would urge you to think mathematically about the distinction as Srikant explains.Heres an extended example from that talk that shows the difference precisely in a discrete example.When I was a child my mother used to occasionally surprise me by ordering a jar of chocolatechip cookies to be delivered by mail. The delivery company stocked four different kinds of cookie jars type A, type B, type C, and type D, and they were all on the same truck and you were never sure what type you would get. Each jar had exactly cookies, but the feature that distinguished the different cookie jars was their respective distributions of chocolate chips per cookie. If you reached into a jar and took out a single cookie uniformly at random, these are the probability distributions you would get on the number of chipsA typeA cookie jar, for example, has cookies with two chips each, and no cookies with four chips or more! A typeD cookie jar has cookies with one chip each. Notice how each vertical column is a probability mass function the conditional probability of the number of chips youd get, given that the jar A, or B, or C, or D, and each column sums to .I used to love to play a game as soon as the deliveryman dropped off my new cookie jar. Id pull one single cookie at random from the jar, count the chips on the cookie, and try to express my uncertainty at the level of which jars it could be. Thus its the identity of the jar A, B, C or D that is the value of the parameter being estimated. The number of chips , , , or is the outcome or the observation or the sample.Originally I played this game using a frequentist, confidence interval. Such an interval needs to make sure that no matter the true value of the parameter, meaning no matter which cookie jar I got, the interval would cover that true value with at least probability.An interval, of course, is a function that relates an outcome a row to a set of values of the parameter a set of columns. But to construct the confidence interval and guarantee coverage, we need to work vertically looking at each column in turn, and making sure that of the probability mass function is covered so that of the time, that columns identity will be part of the interval that results. Remember that its the vertical columns that form a p.m.f.So after doing that procedure, I ended up with these intervalsFor example, if the number of chips on the cookie I draw is , my confidence interval will be B,C,D. If the number is , my confidence interval will be B,C. Notice that since each column sums to or greater, then no matter which column we are truly in no matter which jar the deliveryman dropped off, the interval resulting from this procedure will include the correct jar with at least probability.Notice also that the procedure I followed in constructing the intervals had some discretion. In the column for typeB, I could have just as easily made sure that the intervals that included B would be ,,, instead of ,,,. That would have resulted in coverage for typeB jars , still meeting the lower bound of .My sister Bayesia thought this approach was crazy, though. You have to consider the deliverman as part of the system, she said. Lets treat the identity of the jar as a random variable itself, and lets assume that the deliverman chooses among them uniformly meaning he has all four on his truck, and when he gets to our house he picks one at random, each with uniform probability.With that assumption, now lets look at the joint probabilities of the whole event the jar type and the number of chips you draw from your first cookie, she said, drawing the following tableNotice that the whole table is now a probability mass function meaning the whole table sums to .Ok, I said, where are you headed with this?Youve been looking at the conditional probability of the number of chips, given the jar, said Bayesia. Thats all wrong! What you really care about is the conditional probability of which jar it is, given the number of chips on the cookie! Your interval should simply include the list jars that, in total, have probability of being the true jar. Isnt that a lot simpler and more intuitive?Sure, but how do we calculate that? I asked.Lets say we know that you got chips. Then we can ignore all the other rows in the table, and simply treat that row as a probability mass function. Well need to scale up the probabilities proportionately so each row sums to , though. She didNotice how each row is now a p.m.f., and sums to . Weve flipped the conditional probability from what you started with now its the probability of the man having dropped off a certain jar, given the number of chips on the first cookie.Interesting, I said. So now we just circle enough jars in each row to get up to probability? We did just that, making these credibility intervalsEach interval includes a set of jars that, a posteriori, sum to probability of being the true jar.Well, hang on, I said. Im not convinced. Lets put the two kinds of intervals sidebyside and compare them for coverage and, assuming that the deliveryman picks each kind of jar with equal probability, credibility.Here they areConfidence intervalsCredibility intervalsSee how crazy your confidence intervals are? said Bayesia. You dont even have a sensible answer when you draw a cookie with zero chips! You just say its the empty interval. But thats obviously wrong it has to be one of the four types of jars. How can you live with yourself, stating an interval at the end of the day when you know the interval is wrong? And ditto when you pull a cookie with chips your interval is only correct of the time. Calling this a confidence interval is bullshit.Well, hey, I replied. Its correct of the time, no matter which jar the deliveryman dropped off. Thats a lot more than you can say about your credibility intervals. What if the jar is type B? Then your interval will be wrong of the time, and only correct of the time!This seems like a big problem, I continued, because your mistakes will be correlated with the type of jar. If you send out Bayesian robots to assess what type of jar you have, each robot sampling one cookie, youre telling me that on typeB days, you will expect of the robots to get the wrong answer, each having belief in its incorrect conclusion! Thats troublesome, especially if you want most of the robots to agree on the right answer.PLUS we had to make this assumption that the deliveryman behaves uniformly and selects each type of jar at random, I said. Where did that come from? What if its wrong? You havent talked to him you havent interviewed him. Yet all your statements of a posteriori probability rest on this statement about his behavior. I didnt have to make any such assumptions, and my interval meets its criterion even in the worst case.Its true that my credibility interval does perform poorly on typeB jars, Bayesia said. But so what? Type B jars happen only of the time. Its balanced out by my good coverage of type A, C, and D jars. And I never publish nonsense.Its true that my confidence interval does perform poorly when Ive drawn a cookie with zero chips, I said. But so what? Chipless cookies happen, at most, of the time in the worst case a typeD jar. I can afford to give nonsense for this outcome because NO jar will result in a wrong answer more than of the time.The column sums matter, I said.The row sums matter, Bayesia said.I can see were at an impasse, I said. Were both correct in the mathematical statements were making, but we disagree about the appropriate way to quantify uncertainty.Thats true, said my sister. Want a cookie?",post
93715,3673,3419.0,,"You asked a difficult question, but Im a little bit surprised that the various clues that were suggested to you received so little attention. I upvoted all of them because I think they basically are useful responses, though in their actual form they call for further bibliographic work.Disclaimer I never had to deal with such a problem, but I regularly have to expose statistical results that may differ from physicians a priori beliefs and I learn a lot from unraveling their lines of reasoning. Also, I have some background in teaching human decisionknowledge from the perspective of Artificial Intelligence and Cognitive Science, and I think what you asked is not so far from how experts actually decide that two objects are similar or not, based on their attributes and a common understanding of their relationships.From your question, I noticed two interesting assertions. The first one related to how an expert assess the similarity or difference between two set of measurements I dont particularly care if there is some relation between attribute X and Y. What I care about is if a doctor thinks there is a relation between X and Y.The second one, How can I predict what they think the similarity is? Do they look at certain attributes?looks like it is somewhat subsumed by the former, but it seems more closely related to what are the most salient attributes that allow to draw a clear separation between the objects of interest.To the first question, I would answer Well, if there is no characteristic or objective relationship between any two subjects, what would be the rationale for making up an hypothetical one? Rather, I think the question should be If I only have limited resources knowledge, time, data to take a decision, how do I optimize my choice? To the second question, my answer is Although it seems to partly contradicts your former assertion if there is no relationship at all, it implies that the available attributes are not discriminative or useless, I think that most of the time this is a combination of attributes that makes sense, and not only how a given individual scores on a single attribute.Let me dwell on these two points.Human beings have a limited or bounded rationality, and can take a decision often the right one without examining all possible solutions. There is also a close connection with abductive reasoning.It is well known that there is some variability between individual judgments, and even between judgments from the same expert at two occasions. This is what we are interested in in reliability studies. But you want to know how these experts elaborate their judgments. There is a huge amount of papers about that in cognitive psychology, especially on the fact that relative judgments are easier and more reliable than absolute ones. Doctors decisions are interesting in this respect because they are able to take a good decision with a limited amount of information, but at the same time they benefit from an ever growing internal knowledge base from which they can draw expected relationships extrapolation. In other words, they have a builtin inference assumed to be hypotheticodeductive machinery and accumulate positive evidence or counterfactuals from there experience or practice. Reproducing this inferential ability and the use of declarative knowledge was the aim of several expert or production rule systems in the s, the most famous one being MYCIN, and more generally of Artifical Intelligence earlier in Can we reproduce on an artificial system the intelligent behavior observed in man?. Automatic treatment of speech, problem solving, visual shape recognition are still active projects nowadays and they all have to do with identifying salient features and their relationships to make an appropriate decision i.e., how far should two patterns be to be judged as the emanation of two distinct generating processes?.In sum, our doctors are able to draw an optimal inference from a limited amount of data, compensating from noise that arises simply as a byproduct of individual variability at the level of the patients. Thus, there is a clear connection with statistics and probability theory, and the question is what conscious or subconscious methodology help doctors forming their judgments. Semantic networks SN, belief networks, and decision trees are all relevant to the question you asked. The paper you cited is about using an ontology as a basis of formal judgments, but it is no more than an extension of SNs, and many projects were initiated in this direction I can think of the Gene Ontology for genomic studies, but many others exist in different domains. Now, look at the following hierarchical classification of diagnostic categories it is roughly taken from Dunn , p. And now take a look at the ICD classification I think it is not too far from this schematic classification. Mental disorders are organized into distinct categories, some of them being closer one to each other. What render them similar is the closeness of their expression phenotype in any patient, and the fact that they share some similarities in their somaticpsychological etiology. Assessing whether two doctors would make the same diagnostic is a typical example of an interrater agreement study, where two psychiatrists are asked to place each of several patients in mutually exclusive categories. The hierarchical structure should be reflected in the disagreement between each doctor, that is they may not agree on the finer distinction between diagnostic classes the leafs but if they were to disagree between insomnia and schizophrenia, well it would be little bit disconcerting... How these two doctors decide on which class a given patient belongs to is no more than a clustering problem How likely are two individuals, given a set of observed values on different attributes, to be similar enough so that I decide they share the same class membership? Now, some attributes are more influential than others, and this is exactly what is reflected in the weight attributed to a given attribute in Latent Class Analysis which can be thought of as a probabilistic extension of clustering methods like kmeans, or the variable importance in Random Forests. We need to put things into boxes, because at first sight its simpler. The problem is that often things overlap to some extent, so we need to consider different levels of categorization.In fact, cluster analysis is at the heart of the actual DSM categories, and many papers actually turn around assigning one patient to a specific syndromic category, based on the profile of his response to a battery of neuropsychological assessments. This merely looks like a subtyping approach each time, we seek to refine a preliminary wellestablished diagnostic category, by adding exception rules or an additional relevant symptom or impairment. A related topic is decision trees which are by far the most well understood statistical techniques by physicians. Most of the time, they described a nested series of boolean assertions Do you have a sore throat? If yes, do you have a temperature? etc. but look at an example of public influenza diagnostic tree according to which we can form a decision regarding patients proximity i.e. how similar patients are wrt. attributes considered for building the tree the closer they are the more likely they are to end up in the same leaf. Association rules and the C. algorithm rely quite on the same idea. On a related topic, theres the patient ruleinduction method PRIM. Now clearly, we must make a distinction between all those methods, that make an efficient use of a large body of data and incorporate bagging or boosting to compensate for model fragility or overfitting issues, and doctors who cannot process huge amount of data in an automatic and algorithmic manner. But, for small to moderate amount of descriptors, I think they perform quite good after all.The yesorno approach is not the panacea, though. In behavioral genetics and psychiatry, it is commonly argued that the classification approach is probably not the best way to go, and that common diseases learning disorders, depression, personality disorders, etc. reflect a continuum rather than classes of opposite valence. Nobodys perfect!In conclusion, I think doctors actually hold a kind of internalized inference engine that allows them to assign patients into distinctive classes that are characterized by a weighted combination of available evidences in other words, they are able to organize their knowledge in an efficient manner, and these internal representations and the relationships they share may be augmented throughout experience. Casebased reasoning probably comes into play at some point too. All of this may be subjected to a revision with newly available data we are not simply acting as definitive binary classifiers, and are able to incorporate new data in our decision making, and b subjective biases arising from past experience or wrong selfmade association rules. However, they are prone to errors, as every decision systems...All statistical techniques reflecting these steps decisions trees, baggingboosting, cluster analysis, latent cluster analysis seems relevant to your questions, although they may be hard to instantiate in a single decision rule. Here are a couple of references that might be helpful, as a first start to how doctors make their decisionsA clinical decision support system for clinicians for determining appropriate radiologic imaging examinationGrzymalaBusse, JW. Selected Algorithms of Machine Learning from Examples. Fundamenta Informaticae , Santiago Medina, L, Kuntz, KM, and Pomeroy, S. Children With Headache Suspected of Having a Brain Tumor A CostEffectiveness Analysis of Diagnostic Strategies. Pediatrics , Building Better Algorithms for the Diagnosis of Nontraumatic HeadacheJenkins, J, Shields, M, Patterson, C, and Kee, F. Decision making in asthma exacerbation a clinical judgement analysis. Arch Dis Child , Croskerry, P. Achieving quality in clinical decision making cognitive strategies and detection of bias. Acad Emerg Med , .Cahan, A, Gilon, D, Manor, O, and Paltiel. Probabilistic reasoning and clinical decisionmaking do doctors overestimate diagnostic probabilities? QJM , Wegwarth, O, Gaissmaier, W, and Gigerenzer, G. Smart strategies for doctors and doctorsintraining heuristics in medicine. Medical Education ,",post


In [0]:
pd.options.display.max_colwidth = 50

In [0]:
df = df[~df.index.isin(long_text.index)]

In [39]:
df.shape

(784055, 5)

In [40]:
df.iloc[0], df.iloc[45457], df.iloc[-1]

(post_id                                                       3
 parent_id                                                   NaN
 comment_id                                                  NaN
 text          What are some valuable Statistical Analysis op...
 category                                                  title
 Name: 2, dtype: object,
 post_id                                                  288819
 parent_id                                                   NaN
 comment_id                                                  NaN
 text          Using categorical variable in Conditional Logi...
 category                                                  title
 Name: 49662, dtype: object,
 post_id                                                  279999
 parent_id                                                   NaN
 comment_id                                               542550
 text          As per your other question, your data does not...
 category                           

In [41]:
new_length = df['text'].apply(lambda x: len(x))
min(new_length), max(new_length)

(30, 9998)

## 5. Use a tokenizer
 to create a version of the original text that is a string of space-separated lowercase tokens. 


In [0]:
from nltk.tokenize import WordPunctTokenizer 
     
# Create a reference variable for Class WordPunctTokenizer 
tk = WordPunctTokenizer() 

df['text'] = df['text'].apply(lambda x: tk.tokenize(x)) 
     


In [43]:
print(df.iloc[0])
print('---')
print(df.iloc[456785])

post_id                                                       3
parent_id                                                   NaN
comment_id                                                  NaN
text          [What, are, some, valuable, Statistical, Analy...
category                                                  title
Name: 2, dtype: object
---
post_id                                                  362481
parent_id                                                   NaN
comment_id                                               680911
text          [This, is, possibly, better, asked, on, crossv...
category                                                comment
Name: 472449, dtype: object


## 6. Export the resulting dataframe into a csv file

In [0]:
df.to_csv('tokenized_text.csv', index = True)