Note: This notebook is inspired from https://www.kaggle.com/fabiendaniel/customer-segmentation

With our data ready to be clustered from the notebook HackathonEDA. We can start clustering the data.

Let's first load the latest data

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 50)

data = pd.read_csv("../../data/processed/tempDataFrameLandmark/ratefinalData2ndCheckpoint.csv")
data = data.drop(['Unnamed: 0'], axis = 1)
data.head()

### Creating clusters of counties with high number of tourist per population

In this section I will group the counties based on how popular they are in tourism considering their overall population. We will start with good old <b>kmeans</b> clustering. But before that, lets create a new dataframe that contains the values of tourism out of the complete data that has been prepared. We are interested in following fields. 

* lodgingInventoryBucketNonVacationalRental
* lodgingInventoryBucketVacationalRental
* lIBNonVRRatioToPopulation
* lIBVRRatioToPopulation
* aIPBRatioToPopulation
* AERAFEmployementRatio

In [None]:
tourismByCounty = data.filter(['lodgingInventoryBucketNonVacationalRental', 'lodgingInventoryBucketVacationalRental', 'lIBNonVRRatioToPopulation', 'lIBVRRatioToPopulation', 'aIPBRatioToPopulation', 'AERAFEmployementRatio'], axis=1)
tourismByCounty.head()

In order to define (approximately) the number of clusters that best represents the data, I use the <b>silhouette score</b>:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from tqdm import tqdm

matrix = tourismByCounty.as_matrix()
for n_clusters in tqdm(range(3,10)):
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
    kmeans.fit(matrix)
    clusters = kmeans.predict(matrix)
    silhouette_avg = silhouette_score(matrix, clusters)
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)
    if(n_clusters % 10 == 0):
        unique, counts = np.unique(clusters, return_counts=True)
        print("Minimum values in cluster is: " , min(counts))

The score for 3 clusters is best but this will be very less of a cluster. Let's group the values in 5 clusters.

In [None]:
n_clusters = 5
silhouette_avg = -1
while silhouette_avg < 0.3199:
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
    kmeans.fit(matrix)
    clusters = kmeans.predict(matrix)
    silhouette_avg = silhouette_score(matrix, clusters)
    
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)

<b>Characterizing the content of clusters</b>

Number of elements in each clusters

In [None]:
import numpy as np
unique, counts = np.unique(clusters, return_counts=True)
print(np.asarray((unique, counts)).T)

<b>Silhouette intra-cluster score</b>

In order to have an insight on the quality of the classification, we can represent the silhouette scores of each element of the different clusters.

In [None]:
def graph_component_silhouette(n_clusters, lim_x, mat_size, sample_silhouette_values, clusters):
    plt.rcParams["patch.force_edgecolor"] = True
    plt.style.use('fivethirtyeight')
    mpl.rc('patch', edgecolor = 'dimgray', linewidth=1)
    #____________________________
    fig, ax1 = plt.subplots(1, 1)
    fig.set_size_inches(10, 8)
    ax1.set_xlim([lim_x[0], lim_x[1]])
    ax1.set_ylim([0, mat_size + (n_clusters + 1) * 10])
    y_lower = 10
    for i in range(n_clusters):
        #___________________________________________________________________________________
        # Aggregate the silhouette scores for samples belonging to cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[clusters == i]
        ith_cluster_silhouette_values.sort()
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i
        cmap = cm.get_cmap("Spectral")
        color = cmap(float(i) / n_clusters)        
        ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values,
                           facecolor=color, edgecolor=color, alpha=0.8)
        #____________________________________________________________________
        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.03, y_lower + 0.5 * size_cluster_i, str(i), color = 'red', fontweight = 'bold',
                bbox=dict(facecolor='white', edgecolor='black', boxstyle='round, pad=0.3'))
        #______________________________________
        # Compute the new y_lower for next plot
        y_lower = y_upper + 10

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm

%matplotlib inline

#____________________________________
# define individual silouhette scores
sample_silhouette_values = silhouette_samples(matrix, clusters)
#__________________
# and do the graph
graph_component_silhouette(n_clusters, [-0.4, 0.7], len(tourismByCounty), sample_silhouette_values, clusters)

Now we can take a look at what each cluster and what is in them. And Rank clusters which attract more tourist per population.

In [None]:
import operator
clusterScore = {}

for i in range(n_clusters):
    county_cluster = tourismByCounty.loc[clusters == i]
    clusterScore[i] = (county_cluster['lIBNonVRRatioToPopulation'].mean() + 
                           county_cluster['lIBVRRatioToPopulation'].mean() + 
                           county_cluster['aIPBRatioToPopulation'].mean() + 
                           county_cluster['AERAFEmployementRatio'].mean()) / len(county_cluster)
    
sorted_d = dict( sorted(clusterScore.items(), key=operator.itemgetter(1),reverse=True))
print('Dictionary in descending order by value : ',sorted_d)

Let's add the cluster number to the counties in main dataframe

In [None]:
data.loc[:, 'tourismCluster'] = clusters
data.head()

Let's look at the first county with highest tourism per population i.e counties in cluster 2.

In [None]:
data[data['tourismCluster'] == 4].iloc[0]

This is the county: Denali Borough in Alaska. And Denali National Park is Alaska’s most popular land attraction. Source: https://www.alaska.org/destination/denali-national-park

Which proves that clustering done by our algorithm is quite nice. Here is a image of this beautiful county.

In [None]:
from IPython.display import Image
Image("../../data/raw/Images/denali.jpg")

Adding a checkpoint

In [None]:
data.to_csv(r'../../data/processed/tempDataFrameLandmark/dataWithTourismCluster.csv') 

In [None]:
data.head()

## Clustering of counties based on positive Economic social impact.

In this section we will cluster the counties that are bringing positive economic impact to the society. Following are the fields that we are going to use to do clustering.
* medianHouseHoldIncome
* unEmployementRate
* vacantHousingUnitsRatio
* familiesUnderPovertyScale
* totalEmployedInOwnBusinesRate

Scaling median household income

In [None]:
data['medianHouseHoldIncome'] = (data['medianHouseHoldIncome'] / max(data['medianHouseHoldIncome'])) * 100

In [None]:
economicsByCounties = data.filter(['medianHouseHoldIncome', 'unEmployementRate', 'vacantHousingUnitsRatio', 'familiesUnderPovertyScale', 'totalEmployedInOwnBusinesRate'], axis=1)
economicsByCounties.head()

<b>Data encoding</b>
The different variables I selected have quite different ranges of variation and before continuing the analysis, I create a matrix where these data are standardized

In [None]:
from sklearn.preprocessing import StandardScaler

matrix = economicsByCounties.as_matrix()
scaler = StandardScaler()
scaler.fit(matrix)
print('variables mean values: \n' + 90*'-' + '\n' , scaler.mean_)
scaled_matrix = scaler.transform(matrix)

In [None]:
# matrix = economicsByCounties.as_matrix()
for n_clusters in tqdm(range(3,10)):
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
    kmeans.fit(scaled_matrix)
    clusters = kmeans.predict(scaled_matrix)
    silhouette_avg = silhouette_score(scaled_matrix, clusters)
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)
    if(n_clusters % 10 == 0):
        unique, counts = np.unique(clusters, return_counts=True)
        print("Minimum values in cluster is: " , min(counts))

We see a jump till cluster 6 and then the silhouette score starts to drop. So lets take 6 as the cluster size.

In [None]:
n_clusters = 5
silhouette_avg = -1
while silhouette_avg < 0.255:
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
    kmeans.fit(scaled_matrix)
    clusters = kmeans.predict(scaled_matrix)
    silhouette_avg = silhouette_score(scaled_matrix, clusters)
    
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)

<b>Characterizing the content of clusters</b>

Number of elements in each clusters

In [None]:
unique, counts = np.unique(clusters, return_counts=True)
print(np.asarray((unique, counts)).T)

<b>Silhouette intra-cluster score</b>

In order to have an insight on the quality of the classification, we can represent the silhouette scores of each element of the different clusters.

In [None]:
sample_silhouette_values = silhouette_samples(scaled_matrix, clusters)
#____________________________________
# define individual silouhette scores
sample_silhouette_values = silhouette_samples(scaled_matrix, clusters)
#__________________
# and do the graph
graph_component_silhouette(n_clusters, [-0.2, 0.6], len(scaled_matrix), sample_silhouette_values, clusters)

<b>Counties Morphotype</b>
I have verified that the different clusters are indeed disjoint (at least, in a global way). It remains to understand the behaviour of the counties in each cluster. To do so, I start by adding to the economicsByCounties dataframe a variable that defines the cluster to which each client belongs:

In [None]:
economicsByCounties.loc[:, 'cluster'] = clusters
economicsByCounties.head()

In [None]:
merged_df = pd.DataFrame()
for i in range(n_clusters):
    test = pd.DataFrame(economicsByCounties[economicsByCounties['cluster'] == i].mean())
    test = test.T.set_index('cluster', drop = True)
    test['size'] = economicsByCounties[economicsByCounties['cluster'] == i].shape[0]
    merged_df = pd.concat([merged_df, test])

In [None]:
def _scale_data(data, ranges):
    (x1, x2) = ranges[0]
    d = data[0]
    return [(d - y1) / (y2 - y1) * (x2 - x1) + x1 for d, (y1, y2) in zip(data, ranges)]

class RadarChart():
    def __init__(self, fig, location, sizes, variables, ranges, n_ordinate_levels = 6):

        angles = np.arange(0, 360, 360./len(variables))

        ix, iy = location[:] ; size_x, size_y = sizes[:]
        
        axes = [fig.add_axes([ix, iy, size_x, size_y], polar = True, 
        label = "axes{}".format(i)) for i in range(len(variables))]

        _, text = axes[0].set_thetagrids(angles, labels = variables)
        
        for txt, angle in zip(text, angles):
            if angle > -1 and angle < 181:
                txt.set_rotation(angle - 90)
            else:
                txt.set_rotation(angle - 270)
        
        for ax in axes[1:]:
            ax.patch.set_visible(False)
            ax.xaxis.set_visible(False)
            ax.grid("off")
        
        for i, ax in enumerate(axes):
            grid = np.linspace(*ranges[i],num = n_ordinate_levels)
            grid_label = [""]+["{:.0f}".format(x) for x in grid[1:-1]]
            ax.set_rgrids(grid, labels = grid_label, angle = angles[i])
            ax.set_ylim(*ranges[i])
        
        self.angle = np.deg2rad(np.r_[angles, angles[0]])
        self.ranges = ranges
        self.ax = axes[0]
                
    def plot(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges)
        self.ax.plot(self.angle, np.r_[sdata, sdata[0]], *args, **kw)

    def fill(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges)
        self.ax.fill(self.angle, np.r_[sdata, sdata[0]], *args, **kw)

    def legend(self, *args, **kw):
        self.ax.legend(*args, **kw)
        
    def title(self, title, *args, **kw):
        self.ax.text(0.9, 1, title, transform = self.ax.transAxes, *args, **kw)

In [None]:
fig = plt.figure(figsize=(10,12))

attributes = ['medianHouseHoldIncome', 'unEmployementRate', 'vacantHousingUnitsRatio', 'familiesUnderPovertyScale', 'totalEmployedInOwnBusinesRate']
ranges = [[0.01, 100], [0.01, 100], [0.01, 100], [0.01, 100], [0.01, 100]]
index  = [0, 1, 2, 3, 4]

n_groups = n_clusters ; i_cols = 2
i_rows = n_groups//i_cols
size_x, size_y = (1/i_cols), (1/i_rows)

for ind in range(n_clusters):
    ix = ind%3 ; iy = i_rows - ind//3
    pos_x = ix*(size_x + 0.05) ; pos_y = iy*(size_y + 0.05)            
    location = [pos_x, pos_y]  ; sizes = [size_x, size_y] 
    #______________________________________________________
    values = np.array(merged_df.loc[index[ind], attributes])    
    radar = RadarChart(fig, location, sizes, attributes, ranges)
    radar.plot(values, color = 'b', linewidth=2.0)
    radar.fill(values, alpha = 0.2, color = 'b')
    radar.title(title = 'cluster nº{}'.format(index[ind]), color = 'r')
    ind += 1

Now we can take a look at what each cluster and what is in them. And Rank clusters which attract more tourist per population.

In [None]:
clusterScore = {}

for i in range(n_clusters):
    county_cluster = economicsByCounties.loc[clusters == i]
    clusterScore[i] = (county_cluster['medianHouseHoldIncome'].mean() - 
                           county_cluster['unEmployementRate'].mean() - 
                           county_cluster['vacantHousingUnitsRatio'].mean() - 
                           county_cluster['familiesUnderPovertyScale'].mean() + 
                           county_cluster['totalEmployedInOwnBusinesRate'].mean()) / len(county_cluster)
    
sorted_d = dict( sorted(clusterScore.items(), key=operator.itemgetter(1),reverse=True))
print('Dictionary in descending order by value : ',sorted_d)

Adding the value  to the main dataFrame.

In [None]:
data.loc[:,'economicsCluster'] = clusters

In [None]:
data.to_csv(r'../../data/processed/tempDataFrameLandmark/dataWithEconomicCluster.csv') 

## Clustering of counties based on positive Cultural social impact.

In this section we will cluster the counties that are bringing positive cultural impact to the society. Following are the fields that we are going to use to do clustering.
* minorityPopulationRatio
* structureBuiltYearBefore1939Ratio
* CustomerSatisfactionAvgReviewRating
* CustomerSatisfactionAvgStarRating

In [None]:
cultureByCounties = data.filter(['minorityPopulationRatio', 'structureBuiltYearBefore1939Ratio', 'CustomerSatisfactionAvgReviewRating', 'CustomerSatisfactionAvgStarRating'], axis=1)
cultureByCounties.head()

<b>Data encoding</b>
The different variables I selected have quite different ranges of variation and before continuing the analysis, I create a matrix where these data are standardized

In [None]:
matrix = cultureByCounties.as_matrix()
scaler = StandardScaler()
scaler.fit(matrix)
print('variables mean values: \n' + 90*'-' + '\n' , scaler.mean_)
scaled_matrix = scaler.transform(matrix)

In [None]:
# matrix = economicsByCounties.as_matrix()
for n_clusters in tqdm(range(3,10)):
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
    kmeans.fit(scaled_matrix)
    clusters = kmeans.predict(scaled_matrix)
    silhouette_avg = silhouette_score(scaled_matrix, clusters)
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)
    if(n_clusters % 10 == 0):
        unique, counts = np.unique(clusters, return_counts=True)
        print("Minimum values in cluster is: " , min(counts))

We see a jump till cluster 4 and then the silhouette score starts to drop. So lets take 4 as the cluster size.

In [None]:
n_clusters = 4
silhouette_avg = -1
while silhouette_avg < 0.261:
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
    kmeans.fit(scaled_matrix)
    clusters = kmeans.predict(scaled_matrix)
    silhouette_avg = silhouette_score(scaled_matrix, clusters)
    
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)

<b>Characterizing the content of clusters</b>

Number of elements in each clusters

In [None]:
unique, counts = np.unique(clusters, return_counts=True)
print(np.asarray((unique, counts)).T)

<b>Silhouette intra-cluster score</b>

In order to have an insight on the quality of the classification, we can represent the silhouette scores of each element of the different clusters.

In [None]:
sample_silhouette_values = silhouette_samples(scaled_matrix, clusters)
#____________________________________
# define individual silouhette scores
sample_silhouette_values = silhouette_samples(scaled_matrix, clusters)
#__________________
# and do the graph
graph_component_silhouette(n_clusters, [-0.15, 0.55], len(scaled_matrix), sample_silhouette_values, clusters)

<b>Counties Morphology</b>

In [None]:
cultureByCounties.loc[:, 'cluster'] = clusters
cultureByCounties.head()

In [None]:
merged_df = pd.DataFrame()
for i in range(n_clusters):
    test = pd.DataFrame(cultureByCounties[cultureByCounties['cluster'] == i].mean())
    test = test.T.set_index('cluster', drop = True)
    test['size'] = cultureByCounties[cultureByCounties['cluster'] == i].shape[0]
    merged_df = pd.concat([merged_df, test])

In [None]:
fig = plt.figure(figsize=(10,12))

attributes = ['minorityPopulationRatio', 'structureBuiltYearBefore1939Ratio', 'CustomerSatisfactionAvgReviewRating', 'CustomerSatisfactionAvgStarRating']
ranges = [[0.01, 100], [0.01, 100], [0.01, 100], [0.01, 20], [0.01, 20]]
index  = [0, 1, 2, 3]

n_groups = n_clusters ; i_cols = 2
i_rows = n_groups//i_cols
size_x, size_y = (1/i_cols), (1/i_rows)

for ind in range(n_clusters):
    ix = ind%3 ; iy = i_rows - ind//3
    pos_x = ix*(size_x + 0.05) ; pos_y = iy*(size_y + 0.05)            
    location = [pos_x, pos_y]  ; sizes = [size_x, size_y] 
    #______________________________________________________
    values = np.array(merged_df.loc[index[ind], attributes])    
    radar = RadarChart(fig, location, sizes, attributes, ranges)
    radar.plot(values, color = 'b', linewidth=2.0)
    radar.fill(values, alpha = 0.2, color = 'b')
    radar.title(title = 'cluster nº{}'.format(index[ind]), color = 'r')
    ind += 1

Now we can take a look at what each cluster and what is in them. And Rank clusters which attract more tourist per population.

In [None]:
clusterScore = {}

for i in range(n_clusters):
    county_cluster = cultureByCounties.loc[clusters == i]
    clusterScore[i] = (county_cluster['minorityPopulationRatio'].mean() +
                           county_cluster['structureBuiltYearBefore1939Ratio'].mean() + 
                           county_cluster['CustomerSatisfactionAvgReviewRating'].mean() + 
                           county_cluster['CustomerSatisfactionAvgStarRating'].mean()) / len(county_cluster)
    
sorted_d = dict( sorted(clusterScore.items(), key=operator.itemgetter(1),reverse=True))
print('Dictionary in descending order by value : ',sorted_d)

Adding the value  to the main dataFrame.

In [None]:
data.loc[:,'cultureCluster'] = clusters

In [None]:
data.to_csv(r'../../data/processed/tempDataFrameLandmark/dataWithCultureCluster.csv') 

## Clustering of counties based on positive Environemtal social impact.

In this section we will cluster the counties that are bringing positive Environmental impact to the society. Following are the fields that we are going to use to do clustering.
* waterUsage
* bicycleUsageRate
* airQulaityPM2.5

In [None]:
environmentByCounties = data.filter(['waterUsage', 'bicycleUsageRate', 'airQulaityPM2.5'], axis=1)
environmentByCounties.head()

<b>Data encoding</b>
The different variables I selected have quite different ranges of variation and before continuing the analysis, I create a matrix where these data are standardized

In [None]:
from sklearn.preprocessing import StandardScaler

matrix = environmentByCounties.as_matrix()
scaler = StandardScaler()
scaler.fit(matrix)
print('variables mean values: \n' + 90*'-' + '\n' , scaler.mean_)
scaled_matrix = scaler.transform(matrix)

In [None]:
# matrix = economicsByCounties.as_matrix()
for n_clusters in tqdm(range(3,10)):
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
    kmeans.fit(scaled_matrix)
    clusters = kmeans.predict(scaled_matrix)
    silhouette_avg = silhouette_score(scaled_matrix, clusters)
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)
    if(n_clusters % 10 == 0):
        unique, counts = np.unique(clusters, return_counts=True)
        print("Minimum values in cluster is: " , min(counts))

We see a jump till cluster 6 and then the silhouette score starts to drop. So lets take 6 as the cluster size.

In [None]:
n_clusters = 6
silhouette_avg = -1
while silhouette_avg < 0.4786:
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
    kmeans.fit(scaled_matrix)
    clusters = kmeans.predict(scaled_matrix)
    silhouette_avg = silhouette_score(scaled_matrix, clusters)
    
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)

<b>Characterizing the content of clusters</b>

Number of elements in each clusters

In [None]:
unique, counts = np.unique(clusters, return_counts=True)
print(np.asarray((unique, counts)).T)

<b>Silhouette intra-cluster score</b>

In order to have an insight on the quality of the classification, we can represent the silhouette scores of each element of the different clusters.

In [None]:
sample_silhouette_values = silhouette_samples(scaled_matrix, clusters)
#____________________________________
# define individual silouhette scores
sample_silhouette_values = silhouette_samples(scaled_matrix, clusters)
#__________________
# and do the graph
graph_component_silhouette(n_clusters, [-0.2, 0.8], len(scaled_matrix), sample_silhouette_values, clusters)

<b>Counties Morphology</b>

In [None]:
environmentByCounties.loc[:, 'cluster'] = clusters
environmentByCounties.head()

In [None]:
merged_df = pd.DataFrame()
for i in range(n_clusters):
    test = pd.DataFrame(environmentByCounties[environmentByCounties['cluster'] == i].mean())
    test = test.T.set_index('cluster', drop = True)
    test['size'] = environmentByCounties[environmentByCounties['cluster'] == i].shape[0]
    merged_df = pd.concat([merged_df, test])

In [None]:
fig = plt.figure(figsize=(10,12))

attributes = ['waterUsage', 'bicycleUsageRate', 'airQulaityPM2.5']
ranges = [[0.01, 100], [0.01, 100], [0.01, 100]]
index  = [0, 1, 2]

n_groups = n_clusters ; i_cols = 2
i_rows = n_groups//i_cols
size_x, size_y = (1/i_cols), (1/i_rows)

for ind in range(n_clusters):
    ix = ind%3 ; iy = i_rows - ind//3
    pos_x = ix*(size_x + 0.05) ; pos_y = iy*(size_y + 0.05)            
    location = [pos_x, pos_y]  ; sizes = [size_x, size_y] 
    #______________________________________________________
    values = np.array(merged_df.loc[index[ind], attributes])    
    radar = RadarChart(fig, location, sizes, attributes, ranges)
    radar.plot(values, color = 'b', linewidth=2.0)
    radar.fill(values, alpha = 0.2, color = 'b')
    radar.title(title = 'cluster nº{}'.format(index[ind]), color = 'r')
    ind += 1

Now we can take a look at what each cluster and what is in them. And Rank clusters which attract more tourist per population.

In [None]:
clusterScore = {}

for i in range(n_clusters):
    county_cluster = environmentByCounties.loc[clusters == i]
    clusterScore[i] = (county_cluster['bicycleUsageRate'].mean() -
                       county_cluster['waterUsage'].mean() - 
                       county_cluster['airQulaityPM2.5'].mean()) / len(county_cluster)
    
sorted_d = dict( sorted(clusterScore.items(), key=operator.itemgetter(1),reverse=True))
print('Dictionary in descending order by value : ',sorted_d)

Adding the value  to the main dataFrame.

In [None]:
data.loc[:,'environmentCluster'] = clusters

In [None]:
data.to_csv(r'../../data/processed/tempDataFrameLandmark/dataWithEnvironmentCluster.csv') 