# Data Scientist Assignment
>Salesken

### Problem Statement - 1: Matching the misspelt cities.

There are two data files (.csv):
<ol>
<li>Correct_cities.csv : This file consists of a list of cities and they are spelt correctly. It has three columns "name" which is a city name; "country" where the city belongs to and "id" which is a unique id for each city.</li>

<li>Misspelt_cities.csv : This file consists of city names which are mispelt. Each city name has been misspelt by randomly replacing 20% of the characters by some random characters. It has two columns "misspelt_name" and "country".</li>
</ol>

__Question:__  Write an algorithm or a group of algorithms to match the misspelt city name with the correct city name and return the list of unique ids for each misspelt city name.

__Solution Approach:__<br>
- To find the perfect match of correct city name for every misspelt name, we will find the Hamming distance between misspelt and correct city names.<br>
- Since we have country name column in misspelt data, we won't need to find the distance for every combination of correct and misspelt city name. We will join both datasets on country and find the minimum Hamming distance for every country-correct_city_name combination.

In [1]:
!pip install textdistance

Collecting textdistance
  Downloading https://files.pythonhosted.org/packages/3f/18/31397b687f50ffae65469175f07faa68f288e27fcd8716276004c42e5637/textdistance-4.1.5-py3-none-any.whl
Installing collected packages: textdistance
Successfully installed textdistance-4.1.5


In [1]:
#importing required packages
import numpy as np
import pandas as pd
from textdistance import hamming

In [2]:
#Reading the dataset
correct_cities = pd.read_csv('https://raw.githubusercontent.com/SuvroBaner/SaleskenProblemSolving/master/Correct_cities.csv')
misspelt_cities = pd.read_csv('https://raw.githubusercontent.com/SuvroBaner/SaleskenProblemSolving/master/Misspelt_cities.csv')

In [3]:
print(correct_cities.head(),'\n')
print(misspelt_cities.head())

               name               country       id
0      les Escaldes               Andorra  3040051
1  Andorra la Vella               Andorra  3041563
2    Umm al Qaywayn  United Arab Emirates   290594
3    Ras al-Khaimah  United Arab Emirates   291074
4      Khawr Fakkān  United Arab Emirates   291696 

    misspelt_name        country
0  Hfjdúszoposzló        Hungary
1        Otrajnyy         Russia
2      ian Isidre           Peru
3   Bordj Zemoufa        Algeria
4     ChulamViwta  United States


In [4]:
#Checking nulls in dataset
print(correct_cities.isnull().sum(),'\n')
print(misspelt_cities.isnull().sum())

name       0
country    0
id         0
dtype: int64 

misspelt_name    0
country          0
dtype: int64


In [5]:
#Checking length of datasets
print(len(correct_cities),'\n')
print(len(misspelt_cities))

23018 

23018


In [6]:
#Checking number of unique countries in both datasets and finding number of common countries
print(len(correct_cities['country'].unique()),'\n')
print(len(misspelt_cities['country'].unique()),'\n')
print(len(correct_cities[['country']].merge(correct_cities[['country']], on = 'country', how = 'outer').drop_duplicates()))

244 

244 

244


244 Countries are present in both datasets and all are common.

In [7]:
#Checking number of cities in each country
##If there are many countries with only 1 city, we can separate them out and calculate distance for the remaining countries only
country_size = pd.DataFrame(correct_cities.groupby(['country']).size()).reset_index()
country_size.columns = ['country','size']
print('Countries with only one city: ',len(country_size.loc[country_size['size'] == 1]),'\n')
print('Countries with 5 or more cities: ',len(country_size.loc[country_size['size'] >= 5]))

Countries with only one city:  54 

Countries with 5 or more cities:  164


In [8]:
#Merging both datasets on country to get every correct-misspelt_city name combination at country level
base_df = correct_cities.merge(misspelt_cities, on = 'country', how = 'inner')
base_df.head()

Unnamed: 0,name,country,id,misspelt_name
0,les Escaldes,Andorra,3040051,les vsualdes
1,les Escaldes,Andorra,3040051,Andopma ll Vella
2,Andorra la Vella,Andorra,3041563,les vsualdes
3,Andorra la Vella,Andorra,3041563,Andopma ll Vella
4,Umm al Qaywayn,United Arab Emirates,290594,Dibka fl-Hisn


In [9]:
#Filtering out only those rows where length of string name and misspelt_name is equal
base_df2 = base_df.loc[base_df['name'].str.len() == base_df['misspelt_name'].str.len()]
print('%decrease in number of rows = ', round((len(base_df) - len(base_df2))*100/len(base_df),2),'% /n')
print('New number of rows:', len(base_df2))

%decrease in number of rows =  89.53 % /n
New number of rows: 2246300


In [10]:
#Resetting index for dataset
base_df2.reset_index(inplace=True)
del base_df2['index']
base_df2.head()

Unnamed: 0,name,country,id,misspelt_name
0,les Escaldes,Andorra,3040051,les vsualdes
1,Andorra la Vella,Andorra,3041563,Andopma ll Vella
2,Umm al Qaywayn,United Arab Emirates,290594,Umm al oaywaan
3,Umm al Qaywayn,United Arab Emirates,290594,Ras al-Khaamdh
4,Ras al-Khaimah,United Arab Emirates,291074,Umm al oaywaan


In [15]:
#Filtering out cities where we are able to identify correct city names by just comparing len of name and misspelt_name

#Getting country size for base_df2
city_size_map = pd.DataFrame(base_df2.groupby(['id']).size()).reset_index()
city_size_map.columns = ['id','size']

#Filtering out cities with only one record in base_df2
city_size_map = pd.DataFrame(base_df2.groupby(['id']).size()).reset_index()
city_size_map.columns = ['id','size']
city_size_map = city_size_map[city_size_map['size'] == 1]

#Filtering out country names from base_df2 based on last result
mapped_cities1 = base_df2.loc[base_df2['id'].isin(city_size_map['id'])]

#Remaining unmapped dataset
base_df3 = base_df2[~base_df2['id'].isin(mapped_cities1['id'])].reset_index()
del base_df3['index']
base_df3

Unnamed: 0,name,country,id,misspelt_name
0,Umm al Qaywayn,United Arab Emirates,290594,Umm al oaywaan
1,Umm al Qaywayn,United Arab Emirates,290594,Ras al-Khaamdh
2,Ras al-Khaimah,United Arab Emirates,291074,Umm al oaywaan
3,Ras al-Khaimah,United Arab Emirates,291074,Ras al-Khaamdh
4,Dubai,United Arab Emirates,292223,Dubai
...,...,...,...,...
2245646,Beitbridge,Zimbabwe,895269,Beitbritje
2245647,Beitbridge,Zimbabwe,895269,Ztishavaue
2245648,Epworth,Zimbabwe,1085510,Binduoa
2245649,Epworth,Zimbabwe,1085510,Esworth


In [16]:
#Making a copy of base_df3
base_df4 = base_df3

In [53]:
#Calculating hamming distance between all rows of name and misspelt_name 
base_df4['distance'] = 0
base_df4.loc[:,'distance'] = base_df4.loc[:, ["name","misspelt_name"]].apply(lambda x: hamming(*x), axis=1)

In [54]:
base_df4

Unnamed: 0,name,country,id,misspelt_name,distance
0,Umm al Qaywayn,United Arab Emirates,290594,Umm al oaywaan,2
1,Umm al Qaywayn,United Arab Emirates,290594,Ras al-Khaamdh,11
2,Ras al-Khaimah,United Arab Emirates,291074,Umm al oaywaan,10
3,Ras al-Khaimah,United Arab Emirates,291074,Ras al-Khaamdh,2
4,Dubai,United Arab Emirates,292223,Dubai,0
...,...,...,...,...,...
2245646,Beitbridge,Zimbabwe,895269,Beitbritje,2
2245647,Beitbridge,Zimbabwe,895269,Ztishavaue,8
2245648,Epworth,Zimbabwe,1085510,Binduoa,7
2245649,Epworth,Zimbabwe,1085510,Esworth,1


In [63]:
#Creating mapped_cities dataset based on lowest value of hamming distance
mapped_cities2 = base_df4.loc[base_df4.groupby(['name','id'])['distance'].idxmin(),:'misspelt_name']

In [70]:
#concating mapped_cities1 and mapped_cities2
mapped_cities = mapped_cities1.append(mapped_cities2).reset_index()
mapped_cities = mapped_cities[['country','id','name','misspelt_name']]
mapped_cities

Unnamed: 0,country,id,name,misspelt_name
0,Andorra,3040051,les Escaldes,les vsualdes
1,Andorra,3041563,Andorra la Vella,Andopma ll Vella
2,United Arab Emirates,291696,Khawr Fakkān,Khapr xakkān
3,United Arab Emirates,292231,Dibba Al-Fujairah,tibba wl-Fujairab
4,United Arab Emirates,292239,Dibba Al-Hisn,Dibka fl-Hisn
...,...,...,...,...
23013,Algeria,2508184,’Aïn el Bell,’Afn ez Bell
23014,Algeria,2508180,’Aïn el Berd,’Aïn eluherd
23015,Algeria,2508152,’Aïn el Hammam,’Aïnoel nammam
23016,Algeria,2508130,’Aïn el Melh,’Aïn el Mclg


In [71]:
#Saving the final output to a scv file
mapped_cities.to_csv('mapped_cities.csv', index = False)

---
---
### Problem Statement - 2: Find the Semantic Similarity

<ul>
<li>Part - 1: Given a list of sentences (list_of_setences.txt) write an algorithm which computes the semantic similarity and return the similar sentences together.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between them is based on the likeness of their meaning or semantic content.
    </li>
>For example : "Football is played in Brazil" and "Cricket is played in India". Both these sentences are about sports so they will have a semantic similarity.
<li>
Part - 2: Extend the above algorithm in form of a REST API. The input parameter is a list of sentences (refer to the file list_of_setences.txt) and the response is a list of list with the similar sentences.
    </li>

>For example : Say there are 4 sentences as an input list - ["Football is played in Brazil" , "Cricket is played in India", "Traveling is good for health", "People love traveling in winter"]<br>
Output : [["Football is played in Brazil" , "Cricket is played in India"], ["Traveling is good for health", "People love traveling in winter"]]

In [81]:
!pip install gensim

Collecting gensim
  Downloading https://files.pythonhosted.org/packages/09/ed/b59a2edde05b7f5755ea68648487c150c7c742361e9c8733c6d4ca005020/gensim-3.8.1-cp37-cp37m-win_amd64.whl (24.2MB)
Collecting smart-open>=1.8.1
  Downloading https://files.pythonhosted.org/packages/0c/09/735f2786dfac9bbf39d244ce75c0313d27d4962e71e0774750dc809f2395/smart_open-1.9.0.tar.gz (70kB)
Collecting boto3
  Downloading https://files.pythonhosted.org/packages/73/c1/c25300e0afbe36f550d82affc0d1705076e9ea909ef33e8dd1f7147df10b/boto3-1.12.0-py2.py3-none-any.whl (128kB)
Collecting s3transfer<0.4.0,>=0.3.0
  Downloading https://files.pythonhosted.org/packages/69/79/e6afb3d8b0b4e96cefbdc690f741d7dd24547ff1f94240c997a26fa908d3/s3transfer-0.3.3-py2.py3-none-any.whl (69kB)
Collecting botocore<1.16.0,>=1.15.0
  Downloading https://files.pythonhosted.org/packages/a4/ba/236f25b9200f0cda4842585205b566979484d38927a8a302cc5c1beea10c/botocore-1.15.0-py2.py3-none-any.whl (5.9MB)
Collecting jmespath<1.0.0,>=0.7.1
  Downloading h

In [82]:
#Loading req packages
from collections import defaultdict
from gensim import corpora
import pprint
import nltk 
import string 
import re
import copy

In [155]:
#Reading the dataset
sentences = pd.read_csv('https://raw.githubusercontent.com/SuvroBaner/SaleskenProblemSolving/master/list_of_sentences', 
                               header = None)
sentences.columns = ['sentences']

In [156]:
sentences

Unnamed: 0,sentences
0,good morning
1,how are you doing ?
2,the weather is awesome today
3,samsung
4,good afternoon
5,baseball is played in the USA
6,there is a thunderstorm
7,are you doing good ?
8,"The polar regions are melting"""
9,apple


In [158]:
#Convert dataframe into list
list_of_sentences = sentences['sentences'].values.tolist()

In [159]:
sentences2 = copy.deepcopy(sentences)

Preprocessing:

- Converting all words to lowercase
- Removing punctuation
- Removing all stop words (common english words)
- Tokenize - Break sentences into words
- Stem - Getting root words for all words

In [160]:
#converting text to lower_case
for x in range(len(list_of_sentences)):
    list_of_sentences[x] =  list_of_sentences[x].lower()

list_of_sentences

array(['good morning', 'how are you doing ?',
       'the weather is awesome today', 'samsung', 'good afternoon',
       'baseball is played in the usa', 'there is a thunderstorm ',
       'are you doing good ?', 'the polar regions are melting"', 'apple',
       'nokia', 'cricket is a fun game',
       'the climate change is a problem'], dtype=object)

In [161]:
#Removing punctuation 
def remove_punctuation(text): 
    translator = str.maketrans('', '', string.punctuation) 
    return text.translate(translator)

for x in range(len(list_of_sentences)):
    list_of_sentences[x] =  remove_punctuation(list_of_sentences[x])

list_of_sentences

array(['good morning', 'how are you doing ',
       'the weather is awesome today', 'samsung', 'good afternoon',
       'baseball is played in the usa', 'there is a thunderstorm ',
       'are you doing good ', 'the polar regions are melting', 'apple',
       'nokia', 'cricket is a fun game',
       'the climate change is a problem'], dtype=object)

In [162]:
#creating a copy of list_of_sentences
list_of_sentences_copy = copy.deepcopy(list_of_sentences)

In [163]:
def remove_stopwords(text): 
    stop_words = set(nltk.corpus.stopwords.words("english")) 
    word_tokens = nltk.tokenize.word_tokenize(text) 
    filtered_text = [word for word in word_tokens if word not in stop_words] 
    return filtered_text 
  
for x in range(len(list_of_sentences_copy)):
    list_of_sentences_copy[x] =  remove_stopwords(list_of_sentences_copy[x])

list_of_sentences_copy

array([list(['good', 'morning']), list([]),
       list(['weather', 'awesome', 'today']), list(['samsung']),
       list(['good', 'afternoon']), list(['baseball', 'played', 'usa']),
       list(['thunderstorm']), list(['good']),
       list(['polar', 'regions', 'melting']), list(['apple']),
       list(['nokia']), list(['cricket', 'fun', 'game']),
       list(['climate', 'change', 'problem'])], dtype=object)

Removing stopwords using nltk library removes all words from sentence 2 and takes the meaning away from sentence 8.
Hence, we will not use the stopwords from that library and remove stopwords manually.

In [204]:
# Create a set of frequent words
stoplist = set('for a of the and to in is are you there'.split(' '))

texts = [[word for word in document.split() if word not in stoplist]
         for document in list_of_sentences]
texts

[['good', 'morning'],
 ['how', 'doing'],
 ['weather', 'awesome', 'today'],
 ['samsung'],
 ['good', 'afternoon'],
 ['baseball', 'played', 'usa'],
 ['thunderstorm'],
 ['doing', 'good'],
 ['polar', 'regions', 'melting'],
 ['apple'],
 ['nokia'],
 ['cricket', 'fun', 'game'],
 ['climate', 'change', 'problem']]

In [240]:
from nltk.stem.porter import *
stemmer = PorterStemmer()

def stem(text):
        #tokens = [word for word in nltk.word_tokenize(text) if len(word) > 1]  #tokenize
        stems = [stemmer.stem(item) for item in text] #stem
        return stems

texts2 = []
for x in range(len(texts)):
    texts2.append(stem(texts[x]))

In [241]:
texts2

[['good', 'morn'],
 ['how', 'do'],
 ['weather', 'awesom', 'today'],
 ['samsung'],
 ['good', 'afternoon'],
 ['basebal', 'play', 'usa'],
 ['thunderstorm'],
 ['do', 'good'],
 ['polar', 'region', 'melt'],
 ['appl'],
 ['nokia'],
 ['cricket', 'fun', 'game'],
 ['climat', 'chang', 'problem']]

In [245]:
#TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer(tokenizer=lambda doc: doc, lowercase=False)

In [247]:
tfidf=vectorizer.fit_transform(texts2)

In [267]:
#Cosine Similarity
from sklearn.metrics.pairwise import linear_kernel

# cosine_similarities of element 1 with others
cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()

cosine_similarities

array([1.        , 0.        , 0.        , 0.        , 0.36899732,
       0.        , 0.        , 0.40302781, 0.        , 0.        ,
       0.        , 0.        , 0.        ])

As we can see, 'good morning' is showing good similarity with 'good afternoon' and 'are you doing good?'

Similarly, we can compute Cosine similarity for other elements of dataset.

Other than this, we can use word2vec (or doc2vec) and produce much better results using their pretrained corpus.