**CP2410 Assignment1 By MaoJie**  
This Assignment takes the data from the Travelling Santa 2018 competition at  
<https://www.kaggle.com/c/traveling-santa-2018-prime-paths>.  
kaggle’s notebook:  
<https://www.kaggle.com/jiemao/kernel3352355478>

In [None]:
"""First load the file(ten percent) and look at the data."""
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt
import time
import os
df_cities = pd.read_csv('../input/traveling-santa-2018-prime-paths/cities.csv')
df_cities.head()
percent_rows = int(len(df_cities)*0.1)
ten_percent_data = df_cities.head(percent_rows)
df_cities = ten_percent_data

In [None]:
fig = plt.figure(figsize=(20,20))
#cmap, norm = from_levels_and_colors([0.0, 0.5, 1.5], ['red', 'black'])
plt.scatter(df_cities['X'],df_cities['Y'],marker = '.',c=(df_cities.CityId != 0).astype(int), cmap='Set1', alpha = 0.6, s = 500*(df_cities.CityId == 0).astype(int)+1)
plt.show()

**The red dot indicates the north pole (cityid = 0).  
All I have to do is find a way to start from the red point,  
go through all the other points, and then return to the red  
point, so as to minimize the total travel.**  

In [None]:
# Source = https://www.kaggle.com/seshadrikolluri/understanding-the-problem-and-some-sample-paths
# To improve the performance, instead of checking whether each member is a prime, 
# We first a generate a list where each element tells whether the number indicated 
# By the position is a prime or not. 
# Using sieve of eratosthenes


# A function to tell if a number is a prime
def sieve_of_eratosthenes(n):
    primes = [True for i in range(n+1)] # Start assuming all numbers are primes
    primes[0] = False # 0 is not a prime
    primes[1] = False # 1 is not a prime
    for i in range(2,int(np.sqrt(n)) + 1):
        if primes[i]:
            k = 2
            while i*k <= n:
                primes[i*k] = False
                k += 1
    return(primes)

prime_cities = sieve_of_eratosthenes(max(df_cities.CityId))

**Algorithm Analysis**  

**The Algorithm(Sieve Of Eratosthenes)**  
   1.Produces a list to store all numbers,start assuming all numbers are primes.  
   2.Starting at 2, cross out all multiples of 2, not counting 2 itself.  
   3.Move up to the next number that hasn’t been crossed out.  
   4.Repeat Step 2-3 up till (n<sup>1/2</sup>)
    
**Time complexity:**   
   If the number of outer loop is n times, then the number of inner loop(while i*k <= n...k += 1):  
   
   $[\frac{n}{k}]- 1$< $\frac{n}{k}$  
   
   The total number of loops:  
   
   $\frac{n}{2}$ + $\frac{n}{3}$ + ... + $\frac{n}{n}$  
   
   Then is equivalent to:  
   
   $n(\frac{1}{2}$ + $\frac{1}{3}$ + ... +$\frac{1}{n})$  
   
   From Power Series and Taylor Expansion Formula:  
   
   $n(\frac{1}{2}$ + $\frac{1}{3}$ + ... +$\frac{1}{n})$ = n(ln(n+1)+γ)，γ is Euler constant  
   
   So the time complexity of Eratosthenes is O(nlogn)  
   

**Efficiency of Algorithim:**  
The time complexity of this algorithm is O(nlogn).For the amount of data in this assignment, the efficiency of the algorithm is fairly good, and the running time is acceptable.If the amount of data increases 10 times, the running time will be about 12 times as long as it is now.

In [None]:
"""Dumbest Path: Go in the order of CityIDs: 0, 1, 2.. etc.
and come back to zero when you reach the end."""

start = time.time()

def total_distance(dfcity,path):
    prev_city = path[0]            
    total_distance = 0
    step_num = 1
    for city_num in path[1:]:
        next_city = city_num
        total_distance = total_distance + \
            np.sqrt(pow((dfcity.X[city_num] - dfcity.X[prev_city]),2) + pow((dfcity.Y[city_num] - dfcity.Y[prev_city]),2)) * \
            (1+ 0.1*((step_num % 10 == 0)*int(not(prime_cities[prev_city]))))
        prev_city = next_city
        step_num = step_num + 1
    return total_distance

dumbest_path = list(df_cities.CityId[:].append(pd.Series([0])))
print('Total distance with the dumbest path is '+ "{:,}".format(total_distance(df_cities,dumbest_path)))

end = time.time()

In [None]:
#Running time
print(end - start)

In [None]:
#Path graph of Dumbest Pat
df_path = pd.DataFrame({'CityId':dumbest_path}).merge(df_cities,how = 'left')
fig, ax = plt.subplots(figsize=(20,20))
ax.plot(df_path['X'], df_path['Y']) 

**As we can see, the dumbest path seems pretty bad. Next, I will optimize the path through  
several algorithms to minimize the total distance.**  

**Dumbest Path: Go in the order of CityIDs: 0, 1, 2.. etc.  
and come back to zero when you reach the end.**    
To optimize the path, i used functions:  
1.A function to tell if a number is a prime  
2.Another funtion to compute the total distance with Nearest Neighbour Algorithm   
3.Swaps primes

In [None]:
# Source = https://www.kaggle.com/thexyzt/xyzt-s-visualizations-and-various-tsp-solvers
# Nearest Neighbour
# Starting from the North Pole, travel to the nearest city (without concern for prime-ness of a city).

start = time.time()

def nearest_neighbour():
    cities = pd.read_csv('../input/traveling-santa-2018-prime-paths/cities.csv')
    percent_rows = int(len(cities)*0.1)
    df_cities = cities.head(percent_rows)
    cities = df_cities
    ids = cities.CityId.values[1:]
    xy = np.array([cities.X.values, cities.Y.values]).T[1:]
    path = [0,]
    while len(ids) > 0:
        last_x, last_y = cities.X[path[-1]], cities.Y[path[-1]]
        dist = ((xy - np.array([last_x, last_y]))**2).sum(-1)
        nearest_index = dist.argmin()
        path.append(ids[nearest_index])
        ids = np.delete(ids, nearest_index, axis=0)
        xy = np.delete(xy, nearest_index, axis=0)
    path.append(0)
    return path

nnpath = nearest_neighbour()
print('Total distance with the Nearest Neighbor path '+  "is {:,}".format(total_distance(df_cities,nnpath)))

end = time.time()

In [None]:
#The Nearest Neighbour Algorithm running time
print(end - start)

In [None]:
#Path graph by The Nearest Neighbour Algorithm
df_path = pd.DataFrame({'CityId':nnpath}).merge(df_cities,how = 'left')
fig, ax = plt.subplots(figsize=(20,20))
ax.plot(df_path['X'], df_path['Y']) 

The Nearest Neighbour Algorithm reduced the total distance by about 43840548 units.

**Algorithm Analysis**  

**The Algorithm(Nearest Neighbour):**  
1.Using the array to store the CityId and its corresponding coordinates 

2.From the north pole（CityId=0) to the next nearest city  

3.Record passed place's CityId, Promise not to pass by again  

4.From the current city to the next nearest city  

5.Repeat steps 4&3, last back to start(north pole)  

**Efficiency of Algorithim:**  
The time complexity of this algorithm is O(n).For the amount of data in this assignment, the efficiency of the algorithm is good. If the amount of data increases 10 times, the running time will be about 10 times as long as it is now.

In [None]:
"""
Source = https://www.kaggle.com/seshadrikolluri/understanding-the-problem-and-some-sample-paths
Further optimize the path through swapping the prime:
It says "every 10th step is 10% more lengthy unless coming from a prime CityId".
So I want to make sure that if the cityid of step 10 is a prime number, the distance
from it to the previous city is smaller than the distance from several nearby cities 
to the previous city.
"""
start = time.time()
nnpath_with_primes = nnpath.copy()
for index in range(20,len(nnpath_with_primes)-30):
    city = nnpath_with_primes[index]
    if (prime_cities[city] &  ((index+1) % 10 != 0)):        
        for i in range(-1,3):
            tmp_path = nnpath_with_primes.copy()
            swap_index = (int((index+1)/10) + i)*10 - 1
            tmp_path[swap_index],tmp_path[index] = tmp_path[index],tmp_path[swap_index]
            if total_distance(df_cities,tmp_path[min(swap_index,index) - 1 : max(swap_index,index) + 2]) < total_distance(df_cities,nnpath_with_primes[min(swap_index,index) - 1 : max(swap_index,index) + 2]):
                nnpath_with_primes = tmp_path.copy() 
                break
print('Total distance with the Nearest Neighbor With Prime Swaps '+  "is {:,}".format(total_distance(df_cities,nnpath_with_primes)))
end = time.time()

**The Prime Swaps Algorithm reduced the total distance by another 286 units.**

In [None]:
# Algorithm run time
print(end - start)

In [None]:
#Path graph after The Prime Swaps Algorithm
df_path = pd.DataFrame({'CityId':nnpath_with_primes}).merge(df_cities,how = 'left')
fig, ax = plt.subplots(figsize=(20,20))
ax.plot(df_path['X'], df_path['Y']) 