## Forming Clusters As Needed

#### Will Russell
October 2017

### What is this anyway?

This is a project I put together to demonstrate the unsupervised learning technique employed by the **Forming Clusters As Needed** algorithm as applied to Text Mining coded in python.

The technique was employed on a corpus of 28 sentences as a demo, with a Term Document Matrix used to prepare the corpus for analysis.

###  Wheres the matrix? I want to try it out for myself.

The TDM may be accessed either via the code contained in **`FCAN_code.ipynb`**, as a .csv in regular or transpose form via **tdm.csv** and **transpose_tdm.csv**, or as an excel workbook as tdm.xlsb. All are contained within this directory.

### Description of each of the text mining steps and their purpose.

- A.) **Tokenization of sentences**
    - Tokenization of the sentences is needed to establish the base words of each sentence and provide a base for comparison between sentences. Each sentence can be viewed as an array of words, and to properly identify potential overlap between sentences we need to establish whether items exist between two sentences, hence the tokenization. There is potential for modification of this step if we desire to keep some more of the semantic relations between words in a sentene. Instead of selecting for unigrams, we may choose other n-grams from the sentence and perform further analyses using these if we seek to identify certain phrases. Another alternative to retain certain morphologies would be to first employ named entity recognition in order to select for certain known nouns within the set of documents, substituting known word sets such as 'the new york times' for 'newspaper_organization' in order to retain more knowledge. 
- B.) **Remove punctuation and special characters**
    - Removal of punctuation and special characters a way of reducing noise within the document set in order to establish more clear clustering definitions based on the terms contained therein rather than genitive and grammatical markers. In removing punctuation, we lose potentially valuable information regarding certain relations between objects as well as contextual clues which may provide potentially important semantic information. For example, the interpreted meaning of the statement 'Lets eat, grandpa' changes significantly when the comma is removed. 
- C.) **Remove numbers**
    - For this project we remove numbers in order to reduce the overall sparsity of the TDM and, as in the previous steps, improve relations based on the textual information within the sentences rather than quantitative information. This further restricts potential variability between sentences and allows for more effective clustering based on certain terms. With this step potentially valuable quantitative information is lost which may have improved the relations between two sentences. As with the first step, using a technique such as named entity recognition may allow for substitution of quantities rather than outright elimination, improving grouping of sentences which mention distances, time, or speed with one another.
- D.) **Convert upper-case to lower-case**
    - Conversion of the tokens to all lower-case further improves the grouping of the tokens within our dataset to mitigate the effect of positioning of words within a sentence and to help in grouping certain formal titles and entity labels with subject matter. Potentially valuable information about certain entities and lexical information useful for selection between certain documents may be lost here. 
- E.) **Removal of Stop-words**
    - Stop word removal is an effective means of limiting noise within a corpus and improving the selection of related sentences based on certain content. Leaving the stop words would potentially improve relations between certain sentences based on their grammatical structure rather than certain key words which might be of greater value (putting the weight of words containing 'artificial' at least once closer together than those which inadvertantly may have contained the word 'in' multiple times but speak of entirely different subjects. This provides a benefit in selection, but as with all of the previous steps, certain potentially valueable semantic information may have been lost.
- F.) **Stemming**
    - The stemming of words is a technique which can greatly improve the selection for certain words by reducing a word back to its root stem, however as with other techniques, adverse impacts on the TDM generation may result. The impact can be especially significant certain qualities such as timing or the plurality of a word are desired, as this information is lost. 
- G.) **Combining stemmed words**
    - This technique allows us to reduce the overall distribution and sparsity of the TDM by combining the root words. This is beneficial as it allows us to further minimize the overall space in which we need to select for relations from. The negative impacts in terms of potential information loss are similar to those discussed above in stemming.
    

#### Discussion

I realize that there are a great number of libraries out there which would have greatly simplified this whole process and ultimately might have left us with a better result in the end, but that's not really the point...the point is to learn, right?

So continuing on...

There are a number of changes which can be made to affect the overall performance of the clustering algorithm. The most straightforward methods would be to modify the learning rates and distance measures used to establish the clusters, though changes in the preprocessing would also potentially have a significant impact.

By employing a technique such as Named Entity Recognition, one might improve the clustering of sentences due to the presence certain qualities which were eliminated in preprocessing. For example, improving the relations between sentences discussing autonomous vehicle ranges by replacing quantities such as '300,000' and '443' with a universal tag 'distance_measure'.

In addition, creating a larger term document matrix which included bigram and trigrams may the selection for certain sentences such as 'miles per hour' and the clustering of sentences which discuss velocity. There are certain overlaps which occur in the clusters below, such as in cluster 3, where vehicles discussing autonomous vehicles overlap with those speaking of housing. In this case it seems that the information provided via an expanded TDM utilizing bigrams and trigrams and named entity recognition may have been beneficial to improving the performance of the clustering algorithm. 



#### Example Output

```txt
------Printing out contents of cluster 1  : Cluster length : 7--------
1 : up autonom speed per travel hour road type mile sedan
2 : per hour kilomet car second mile
3 : test charg road rang kilomet achiev around mile
4 : obsolesc sort wai machin escap musk intellig biolog human merger
5 : up charg percent befor kilomet car mile go
6 : autonom per hour lap kilomet around mile sedan
7 : artifici sentienc lead machin necessarili intellig possibl learn

------Printing out contents of cluster 2  : Cluster length : 1--------
1 : awai artifici predict far singular machin year superintellig rai exce intellig kurzweil futur human selfimprov learn

------Printing out contents of cluster 3  : Cluster length : 15--------
1 : room size rent bedroom bath newli larg full home live eat remodel util kitchen
2 : room paint interior entir bedroom larg home live freshli
3 : washer four row bedroom dryer bath home hous basement come finish
4 : space approv owner park possibl back three pet
5 : autonom test driven road limit kilomet public car
6 : size bedroom king larg suit veri nice bed queen
7 : autonom drive complet kilomet over accidentfre mile
8 : updat hous floor new kitchen renov
9 : artifici lisp inventor languag intellig term john mccarthi program coin
10 : deal autonom number road car situat
11 : pound gallon per drive averag over mile sedan
12 : locat famili bedroom bath near home rout major conveni singl
13 : spread throughout book spiritu author machin rai intellig kurzweil describ ag cosmo
14 : cute updat applianc great hous veri classi area live open kitchen
15 : per went hour road lap minut averag around car round second mile

------Printing out contents of cluster 4  : Cluster length : 1--------
1 : base negoti applianc includ water ga respons well tenant electr secur system anim pet

------Printing out contents of cluster 5  : Cluster length : 1--------
1 : know paradigm artifici knowledg experi wai realiti sens combin via intellig two come five everyth

------Printing out contents of cluster 6  : Cluster length : 1--------
1 : reason emot feel artifici engag selfawar multipl knowledg express commonsens t gener attain intellig understand domain comput learn program

------Printing out contents of cluster 7  : Cluster length : 1--------
1 : room sewag artifici includ water trash bedroom central automat themtwo machin through lai townhous groundwork bathroom live eat fundament heat intellig understand world recent comput around air techniqu work increas kitchen learn deep
```