# Tika Instructions
- DONE - first, git clone http://github.com/chrismattmann/tika-similarity.git and then git clone http://github.com/chrismattmann/etllib.git You will use ETLlib for tsv2json

- DONE - then cd tika-similarity and then inside there, do pip install -r requirements.txt . This will take care of installing Tika Similarity's dependencies. 

- DONE - Then inside of etllib do python setup.py install that will install the associated scripts like tsv2json etc (

- make sure that you already have your dataset by this point, with all of your additional features added

- then run tsv2json and generate approx 95k JSON files in a directory 

- once the files are generated in there, you can run similarity.py on the directory with the 95k JSON files

- it will generate a file called similarity-scores.txt that has all the resemblance and pairwise similarity scores for each item in your posts with the additional features

- then you should run cluster-json.py that will read the similarity-scores.txt and from there use the associated threshold to perform hierarchical agglomerative clustering and to generate clusters.json once you have that file generated you can browse the clusters

- you can mess around with different metrics, e.g., try the edit distance version of similarity.py or use value versus key based similarity and so on

# Distance Metrics

- Jaccard similarity: The ratio of the number of common elements between the sets to the total number of distinct elements in both sets. The Jaccard similarity coefficient can range from 0 to 1, where a value of 0 means that the two sets have no elements in common, and a value of 1 means that the two sets are identical.
    - example: A = {1, 2, 3, 4, 5} B = {3, 4, 5, 6, 7}
    - common elements = 3; {3, 4, 5}
    - distinct elements = 7; {1, 2, 3, 4, 5, 6, 7}
    - The Jaccard similarity coefficient --> J(A, B) = |A ∩ B| / |A ∪ B| = 3 / 7 = 0.43

    

- edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into another.
    - For example, consider the strings "cat" and "cut". The edit distance between these two strings is 1, because we can transform "cat" into "cut" by substituting the "a" with a "u".

- cosine similarity: compare the similarity of two documents or sentences represented as vectors of word frequencies. It measures the similarity between two vectors in a multi-dimensional space by taking the cosine of the angle between the vectors
    - The resulting value ranges from -1 to 1, where -1 indicates that the two vectors are completely dissimilar, 0 indicates that they are orthogonal (i.e., perpendicular to each other), and 1 indicates that they are identical.
    - ex: Sentence1: "The cat sat on the mat" ; Sentence2: "The dog chased the cat off the mat"
    - unique words: ["the", "cat", "sat", "on", "mat", "dog", "chased", "off"]
    - Vectors: Sentence1: [2, 1, 1, 1, 2, 1, 0, 0, 0], Sentence2: [3, 1, 0, 0, 1, 1, 1, 1, 1]




In [1]:
import csv
import subprocess
import sys
import re

In [2]:
#Create TSV File
with open('../Master Datasets/final_dataset.csv', 'r') as csv_file, open('../Master Datasets/Master_Dataset.tsv', 'w') as tsv_file:
    csv_reader = csv.reader(csv_file)
    tsv_writer = csv.writer(tsv_file, delimiter='\t')
    for row in csv_reader:
        tsv_writer.writerow(row)


In [5]:
# create headers text file from the tsv

# Set the path to your TSV file and column headers file
tsv_file_path = "../Master Datasets/Master_Dataset.tsv"
column_headers_file_path = "column_headers.txt"

# Open the TSV file and read the column headers
with open(tsv_file_path, "r", encoding="utf-8") as tsv_file:
    reader = csv.reader(tsv_file, delimiter="\t")
    headers = next(reader)
    
# Write the column headers to the text file
with open(column_headers_file_path, "w", encoding="utf-8") as headers_file:
    for header in headers:
        # Add a colon to the end of the header name if it's optional
        if header.endswith(":"):
            header = header[:-1] + ":"
        # Add an asterisk to the end of the header name if it's used for an ID field
        elif header.endswith("*"):
            header = header[:-1] + "*"
        headers_file.write(header + "\n")


#change date to dte


# Open the file for reading
with open('column_headers.txt', 'r') as file:
    content = file.read()

# Replace "Date" with "Dte" using regular expressions
content = re.sub(r'\bAccount Created Date\b', 'Dte', content)

# Replace "Date" with "Dte" using regular expressions
content = re.sub(r'\bDate\b', 'Dte', content)

# Open the file for writing and overwrite the original content
with open('column_headers.txt', 'w') as file:
    file.write(content)



In [None]:
#command to run tsvtojson


python etllib/etl/tsvtojson.py -t Master_Dataset.tsv -j test3json.json -c column_headers.txt -o json_object -s 0.5

#repackge json

python etllib/etl/repackage.py -j test3json.json -o json_object -v

#move to different folder called "json_folder"

find . -name "*.json" -print0 | xargs -0 -I {} mv {} ./json_folder/

#running similarity.py

python tika-similarity/similarity.py -f 95kJSON


In [1]:
#splitting 95kJSON into 100 file chunks

import os
import shutil

# Set the path to the directory containing the JSON files
dir_path = "95kJSON"

# Create a new subdirectory for the files
sub_dir = 1
os.makedirs(os.path.join(dir_path, "subdir_{}".format(sub_dir)), exist_ok=True)

# Iterate through the JSON files in the directory
count = 0
for file_name in os.listdir(dir_path):
    if file_name.endswith(".json"):
        file_path = os.path.join(dir_path, file_name)
        # Move the file to the current subdirectory
        shutil.move(file_path, os.path.join(dir_path, "subdir_{}".format(sub_dir)))
        count += 1
        # If the current subdirectory contains 100 files, create a new one
        if count == 100:
            sub_dir += 1
            os.makedirs(os.path.join(dir_path, "subdir_{}".format(sub_dir)), exist_ok=True)
            count = 0


In [None]:
#jaccard similarity

python tika-similarity/jaccard_similarity.py --inputDir 95kJSON/subdir_1 --outCSV jaccard1.csv

#getting this error after running on python 2.7:     raise RuntimeError("Unable to start Tika server.") RuntimeError: Unable to start Tika server.(python2_7) Daniils-MBP:Tika Analysis daniilabbruzzese$ 
