# Capstone Part 04
*Clustering airline crashes using Unsupervised Learning*

### Submitted by Roshan Lulu
___
<img src="./assets/images/2_cluster.png" alt="Drawing" style="width:600px" align="middle"/>
___

# Locations with high density of crashes
> 
- Based on my EDA I did get some very interesting insights on the aircrashes based on the different features. Further ahead I'd like to see if a clustering algorithm has any better insights!
- The airline crashes always seem like individual events. I am interested to check if there is an area which tends to have higher density of crashes when compared to the other locations.
- **Approach:** 
    - Plotting the accidents wrt their geo co-ordinates, I will check if there are any obvious clusters.
    - Next, I will use a clustering algorthm (mostly DBScan since my aimis to find clusters based on their distances)
    - If a sensible cluster is found with high density of crahses, proceed to label it and then get the count.
    - Further try to get an insight of where the crashes are(using Tableau might be better for viz purposes)
- **Challenge:**
    - Plotting the crashes across the world might be too many data points to start clustering. 
- **Solution:**
    - Start analysing by the cause of the crashes. This would be a good way to check, if a certain type of crashes are common at a location.
    

## 1. Read in Cleaned dataset

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 50)

import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Read CSV
data = pd.read_csv('./dataset/outputof1.csv')
# View columns
print(data.columns)
# data.head(2)

Index(['FlightNumber', 'Time', 'Day', 'Month', 'Year', 'Decade',
       'AirplaneDamage', 'AirplaneFate', 'Phase', 'Crash_Elev_m', 'Age',
       'Operator', 'Nature_Code', 'Type_Code', 'Engine_Type', 'Engine_count',
       'Crew_Fatal', 'Crew_Occ', 'Total_Fatal', 'Total_Occ', 'Psngr_Fatal',
       'Psngr_Occ', 'GndFatal', 'Coll_Fatal', 'Country', 'Continents',
       'Hemisphere', 'Seasons', 'orig_latitude', 'orig_longitude', 'label'],
      dtype='object')


## 2. In this section, I will be checking for accident clusters for NON-Military flights in ASIA!
### I am interested to check if there have been > 20 crashes near any particular site within a distance of 1km from each other.
*Cluster evaluation with DBScan clustering*

- Calculating the epsilon and minimums amples for the DBScan clustering
- Earth radius is the distance from Earth's center to its surface, about 6,371 kilometers (3,959 mi). This length is also used as a unit of distance, especially in astronomy and geology, where it is usually denoted by R⊕ (R_\oplus)."

- Assume a minimum distance between crashes --> 1km
- convert to radian

### Since clustering algorithms are based on Distances, and I am interested in the actual distances, Lat Lon will be my friend throughout this  exercise

> 
### A. PREPARE THE DATA SUBSET AND STANDARDIZE THE Xs

In [3]:
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Get the data
data1 = data[data['Continents'] == 'Asia']
data1 = data1[data1['Nature_Code'] != 'Military']

X = data1[['orig_latitude', 'orig_longitude']]

# Lat vs Long plots
# plt.scatter(X['orig_longitude'],X['orig_latitude'])
# plt.show()

# Scale the data
Xs = StandardScaler().fit_transform(X)
Xs = pd.DataFrame(Xs, columns=X.columns)


min_dist = 1 
eps_radians = min_dist/6371

X.head(2)

Unnamed: 0,orig_latitude,orig_longitude
6,36.139705,120.383496
8,26.2,127.333333


> 
### B. PERFORM DBSCAN CLUSTERING

In [4]:
from sklearn.cluster import DBSCAN
import numpy as np

dbscan = DBSCAN(eps = eps_radians, min_samples = 20, algorithm='ball_tree',metric="haversine")
dbscan.fit(np.radians(Xs))

labels = dbscan.labels_  
Xs['dbscan_label'] = labels
X['dbscan_label'] = labels
# Calculate Silhouette to check how the clusters are separated from each other. 
# Adjust the epsilon and min samples accordingly
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

core_samples = np.zeros_like(labels, dtype = bool)  
core_samples[dbscan.core_sample_indices_] = True 
# print(np.unique(labels))

# Plot the clusters
unique_labels = np.unique(labels)
print(unique_labels)
# colors = plt.cm.Spectral(np.linspace(0,1, len(unique_labels)))

# fig, axarr = plt.subplots(1,1, figsize=(20,10))
# for label, color in zip(Xs.dbscan_label.unique(), colors):
#     if label != -1:
#         X_ = Xs[Xs.dbscan_label == label]
#         axarr.scatter(X_.iloc[:,0], X_.iloc[:,1], s=70, 
#                              color=color, label=label, alpha=0.9)
        
# axarr.set_title("Clusters", fontsize=20)
# axarr.legend(loc='lower right')
# plt.show()

Silhouette Coefficient: -0.265
[-1  0]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [5]:
# Assign the clusters to the labels
print(X.dbscan_label.value_counts())
X.head(2)

-1    1044
 0      20
Name: dbscan_label, dtype: int64


Unnamed: 0,orig_latitude,orig_longitude,dbscan_label
6,36.139705,120.383496,-1
8,26.2,127.333333,-1


> 
### C. PLOT THE CLUSTER WITH MAX POINTS ON THE MAP

In [6]:
# Inference: Cluster8 seems to be a large one. Let me see if I can plto it ont he world map to see how it looks
import folium

#create a map
this_map = folium.Map(prefer_canvas=True)

def plotDot(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    if(point.dbscan_label == 0):
#         print(point)
        folium.CircleMarker(location=[point.orig_latitude,point.orig_longitude],
#                             popup=point['dbscan_label'],
                            fill_color='#132b5e',
                            radius=5,
                            weight=0).add_to(this_map)

#use df.apply(,axis=1) to "iterate" through every row in dataframe
X.apply(plotDot, axis = 1)
# Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())
#Save the map to an HTML file
this_map.save('cluster_in_asia.html')
this_map

> 
### D. INFERENCE FROM THE FINDINGS
**From this I see that there have been around 23 accidents near Delhi, Indira Gandhi Airport. It was aound the same place, hence it has the same latlong**

In [7]:
data1['dbscan_label'] = labels
data1_Asia_Cluster = data1[data1['dbscan_label'] == 0]
pd.DataFrame(data1_Asia_Cluster.Decade.value_counts()).reset_index()
# There were around 6 crashes in 1950 alone!

Unnamed: 0,index,Decade
0,1950,6
1,1990,4
2,1980,3
3,2000,3
4,1960,2
5,1970,1
6,1930,1


In [8]:
data1_Asia_Cluster[data1_Asia_Cluster.Decade != 1930]
# It was mostly Aircraft type Douglas with Engine Type Pratt and Whitney!

Unnamed: 0,FlightNumber,Time,Day,Month,Year,Decade,AirplaneDamage,AirplaneFate,Phase,Crash_Elev_m,Age,Operator,Nature_Code,Type_Code,Engine_Type,Engine_count,Crew_Fatal,Crew_Occ,Total_Fatal,Total_Occ,Psngr_Fatal,Psngr_Occ,GndFatal,Coll_Fatal,Country,Continents,Hemisphere,Seasons,orig_latitude,orig_longitude,label,dbscan_label
1090,1150,03:15,Friday,MAR,2002,2000,Damaged beyond repair,-,Taxi (TXI),0,22,Indian Airlines,Unknown,Airbus,General Electric,2.0,0,5,0,5,0,0,0,0,India,Asia,Northern,Spring,28.556162,77.099958,others,0
1512,1572,08:00,Thursday,MAR,2006,2000,Damaged beyond repair,-,Landing (LDG),0,39,Valan International Cargo Charter,Unknown,Antonov,Ivchenko,2.0,0,0,0,0,0,0,0,0,Iraq,Asia,Northern,Spring,28.546116,77.303804,others,0
2003,2063,16:00,Sunday,FEB,2009,2000,,-,En route (ENR),0,0,IndiGo Airlines,Passenger - Domestic,Airbus,IAE,2.0,0,6,0,169,0,163,0,0,India,Asia,Northern,Winter,28.556162,77.099958,malfunction,0
2732,2792,20:20,Friday,JAN,1983,1980,Substantial,Repaired,Standing (STD),0,4,Indian Airlines,Unknown,Airbus,General Electric,2.0,0,0,0,0,0,0,0,0,India,Asia,Northern,Winter,28.556162,77.099958,meteorological,0
2733,2793,20:20,Friday,JAN,1983,1980,Substantial,Repaired,Taxi (TXI),0,5,Air-India,Passenger - Intl,Boeing,Pratt & Whitney,4.0,0,11,0,284,0,273,0,0,India,Asia,Northern,Winter,28.556162,77.099958,meteorological,0
3066,3126,17:26,Sunday,JUN,1988,1980,Substantial,Written off (damaged beyond repair),Landing (LDG),0,18,Indian Airlines,Passenger - Domestic,Boeing,Pratt & Whitney,2.0,0,6,0,134,0,128,0,0,India,Asia,Northern,Summer,28.556162,77.099958,others,0
3292,3352,09:17,Monday,MAY,1990,1990,Damaged beyond repair,-,Landing (LDG),0,18,Air-India,Passenger - Intl,Boeing,Pratt & Whitney,4.0,0,20,0,215,0,195,0,0,India,Asia,Northern,Spring,28.556162,77.099958,others,0
3670,3730,14:54,Tuesday,MAR,1994,1990,Destroyed,Written off (damaged beyond repair),Standing (STD),0,5,Aeroflot,Unknown,Ilyushin,Kuznetsov,4.0,4,4,4,4,0,0,5,0,India,Asia,Northern,Spring,28.556162,77.099958,airtrafficmgmt,0
3672,3732,14:54,Tuesday,MAR,1994,1990,Destroyed,Written off (damaged beyond repair),Takeoff (TOF),237,15,Sahara Airlines,Train/Test/Demo,Boeing,Pratt & Whitney,2.0,4,4,4,4,0,0,5,0,India,Asia,Northern,Spring,28.556162,77.099958,airtrafficmgmt,0
3779,3839,04:08,Saturday,JAN,1993,1990,Damaged beyond repair,-,Landing (LDG),0,11,Indian Airlines,Passenger - Domestic,Tupolev,Kuznetsov,3.0,0,13,0,165,0,152,0,0,India,Asia,Northern,Winter,28.556162,77.099958,human,0


## Inference

Accidents were mostly from the 1960s to 2000.
There were 2 collisions that occured
Aircraft type and Engine type
Airline
Phase
Fatalities

Now: It has improved since 2000. There have been not uch of accidents in that area ever since. So, there is an improvement that is observed!

## 3. In this section, I will be checking for accident clusters for NON-Military flights in AUSTRALIA!
### I am interested to check if there have been atleast 10 crashes near any particular site within a distance of 1km from each other.
*Cluster evaluation with DBScan clustering*

- Calculating the epsilon and minimums amples for the DBScan clustering
- Earth radius is the distance from Earth's center to its surface, about 6,371 kilometers (3,959 mi). This length is also used as a unit of distance, especially in astronomy and geology, where it is usually denoted by R⊕ (R_\oplus)."

- Assume a minimum distance between crashes --> 1km
- convert to radian

### Since clustering algorithms are based on Distances, and I am interested in the actual distances, Lat Lon will be my friend throughout this  exercise

In [9]:
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Get the data
data1 = data[data['Continents'] == 'Africa']
data1 = data1[data1['Nature_Code'] != 'Military']

X = data1[['orig_latitude', 'orig_longitude']]

# Lat vs Long plots
# plt.scatter(X['orig_longitude'],X['orig_latitude'])
# plt.show()

# Scale the data
Xs = StandardScaler().fit_transform(X)
Xs = pd.DataFrame(Xs, columns=X.columns)

# Goals against vs. goals for
# plt.scatter(Xs['orig_longitude'],Xs['orig_latitude'])
# plt.show()

min_dist = 1
eps_radians = min_dist/6371

X.head(2)

Unnamed: 0,orig_latitude,orig_longitude
18,11.561443,43.144739
36,32.116236,20.07279


> 
### B. PERFORM DBSCAN CLUSTERING

In [10]:
from sklearn.cluster import DBSCAN
import numpy as np

dbscan = DBSCAN(eps = eps_radians, min_samples = 15, algorithm='ball_tree',metric="haversine")
dbscan.fit(np.radians(Xs))

labels = dbscan.labels_  
Xs['dbscan_label'] = labels
X['dbscan_label'] = labels
# Calculate Silhouette to check how the clusters are separated from each other. 
# Adjust the epsilon and min samples accordingly
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

core_samples = np.zeros_like(labels, dtype = bool)  
core_samples[dbscan.core_sample_indices_] = True 
# print(np.unique(labels))

# Plot the clusters
unique_labels = np.unique(labels)
print(unique_labels)
# colors = plt.cm.Spectral(np.linspace(0,1, len(unique_labels)))

# fig, axarr = plt.subplots(1,1, figsize=(20,10))
# for label, color in zip(Xs.dbscan_label.unique(), colors):
#     if label != -1:
#         X_ = Xs[Xs.dbscan_label == label]
#         axarr.scatter(X_.iloc[:,0], X_.iloc[:,1], s=70, 
#                              color=color, label=label, alpha=0.9)
        
# axarr.set_title("Clusters", fontsize=20)
# axarr.legend(loc='lower right')
# plt.show()

Silhouette Coefficient: -0.336
[-1  0  1]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [11]:
# Assign the clusters to the labels
print(X.dbscan_label.value_counts())
X.head(2)

-1    557
 0     29
 1     18
Name: dbscan_label, dtype: int64


Unnamed: 0,orig_latitude,orig_longitude,dbscan_label
18,11.561443,43.144739,-1
36,32.116236,20.07279,-1


> 
### C. PLOT THE CLUSTER WITH MAX POINTS ON THE MAP

In [12]:
# Inference: Cluster8 seems to be a large one. Let me see if I can plto it ont he world map to see how it looks
import folium

#create a map
this_map = folium.Map(prefer_canvas=True)

def plotDot(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    if(point.dbscan_label != -1):
#         print(point)
        folium.CircleMarker(location=[point.orig_latitude,point.orig_longitude],
                            popup=str(point['dbscan_label']),
                            fill_color='#000000',
                            radius=5,
                            weight=5).add_to(this_map)

#use df.apply(,axis=1) to "iterate" through every row in dataframe
X.apply(plotDot, axis = 1)
# Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())
#Save the map to an HTML file
this_map.save('cluster_in_africa.html')
this_map

> 
### D. INFERENCE FROM THE FINDINGS
**From this I see that there have been around 23 accidents near Delhi, Indira Gandhi Airport. It was aound the same place, hence it has the same latlong**

In [13]:
data1['dbscan_label'] = labels
data1_Africa_Cluster = data1[data1['dbscan_label'] == 0]
pd.DataFrame(data1_Africa_Cluster.Decade.value_counts()).reset_index()
# There were around 6 crashes in 1950 alone!

Unnamed: 0,index,Decade
0,2000,11
1,1990,10
2,2010,5
3,1970,2
4,1980,1


In [15]:
data1_Africa_Cluster[data1_Africa_Cluster.Decade == 2000]
# It was mostly Aircraft type Douglas with Engine Type Pratt and Whitney!

Unnamed: 0,FlightNumber,Time,Day,Month,Year,Decade,AirplaneDamage,AirplaneFate,Phase,Crash_Elev_m,Age,Operator,Nature_Code,Type_Code,Engine_Type,Engine_count,Crew_Fatal,Crew_Occ,Total_Fatal,Total_Occ,Psngr_Fatal,Psngr_Occ,GndFatal,Coll_Fatal,Country,Continents,Hemisphere,Seasons,orig_latitude,orig_longitude,label,dbscan_label
797,857,11:30,Friday,APR,2000,2000,Damaged beyond repair,-,Standing (STD),0,11,Blue Lines,Unknown,Antonov,Glushenkov,2.0,0,0,0,0,0,0,0,0,"Congo, The Democratic Republic of the",Africa,Southern,Autumn,-4.386449,15.445372,maintanence,0
798,858,11:30,Friday,APR,2000,2000,Damaged beyond repair,-,Standing (STD),0,36,Air Force,Unknown,Others,Rolls-Royce,4.0,0,0,0,0,0,0,0,0,"Congo, The Democratic Republic of the",Africa,Southern,Autumn,-4.386449,15.445372,maintanence,0
799,859,11:30,Friday,APR,2000,2000,Damaged beyond repair,-,Standing (STD),0,39,Air Force,Unknown,Others,Rolls-Royce,4.0,0,0,0,0,0,0,0,0,"Congo, The Democratic Republic of the",Africa,Southern,Autumn,-4.386449,15.445372,maintanence,0
922,982,20:00,Sunday,APR,2001,2000,Damaged beyond repair,-,Unknown (UNK),0,4,Union Charter Trust,Unknown,Cessna,Pratt & Whitney,1.0,0,0,0,9,0,0,0,0,"Congo, The Democratic Republic of the",Africa,Southern,Autumn,-4.408908,15.397444,others,0
965,1025,22:00,Friday,APR,2002,2000,Damaged beyond repair,-,Landing (LDG),0,29,Hewa Bora Airways,Cargo,Boeing,Pratt & Whitney,4.0,0,3,0,3,0,0,0,0,"Congo, The Democratic Republic of the",Africa,Southern,Autumn,-4.386449,15.445372,airtrafficmgmt,0
987,1047,18:30,Tuesday,APR,2003,2000,Substantial,-,Landing (LDG),0,14,Avirex,Passenger - Intl,Beechcraft,Pratt & Whitney,2.0,0,2,0,15,0,13,0,0,"Congo, The Democratic Republic of the",Africa,Southern,Autumn,-4.386449,15.445372,others,0
995,1055,10:33,Friday,APR,2003,2000,Substantial,Repaired,Landing (LDG),0,32,Wetrafa,Passenger - Domestic,Douglas,Pratt & Whitney,2.0,0,8,0,45,0,37,0,0,Congo,Africa,Southern,Autumn,-4.258899,15.251139,meteorological,0
1312,1372,10:00,Tuesday,JUN,2007,2000,Damaged beyond repair,-,En route (ENR),0,19,Business Aviation,Unknown,Others,Walter,2.0,0,0,0,0,0,0,0,0,Congo,Africa,Southern,Winter,-4.26336,15.242885,others,0
1313,1373,24:00,Tuesday,JUN,2007,2000,Damaged beyond repair,-,En route (ENR),0,19,Business Aviation,Unknown,Others,Walter,2.0,0,0,0,0,0,0,0,0,Congo,Africa,Southern,Winter,-4.26336,15.242885,others,0
1509,1569,06:00,Friday,MAR,2006,2000,Substantial,Written off (damaged beyond repair),Standing (STD),0,53,LAC - SkyCongo,Unknown,Others,Unknown,-1.0,0,0,0,0,0,0,0,0,"Congo, The Democratic Republic of the",Africa,Southern,Autumn,-4.386449,15.445372,others,0


## Inference

Accidents were mostly from the 1960s to 2000.
There were 2 collisions that occured
Aircraft type and Engine type
Airline
Phase
Fatalities

Now: It has improved since 2000. There have been not uch of accidents in that area ever since. So, there is an improvement that is observed!

# References
 
1. https://stats.stackexchange.com/questions/121916/why-are-mixed-data-a-problem-for-euclidean-based-clustering-algorithms

2. http://scikit-learn.org/stable/modules/clustering.html

3. https://stats.stackexchange.com/questions/187595/clustering-with-categorical-and-numeric-data

4. https://stackoverflow.com/questions/34579213/dbscan-for-clustering-of-geographic-location-data

5. http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/

# Try tomorrow
-- http://qingkaikong.blogspot.com.au/2016/08/clustering-with-dbscan.html

# Valid results above this

## K - Means Clustering
https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:

1. The centroids of the K clusters, which can be used to label new data

2. Labels for the training data (each data point is assigned to a single cluster)

3. Feature engineering is the process of using domain knowledge to choose which data metrics to input as features into a machine learning algorithm. Feature engineering plays a key role in K-means clustering; using meaningful features that capture the variability of the data is essential for the algorithm to find all of the naturally-occurring groups.  

### Some light on Continuous and Categorical variables as predictors when performing Clustering!
**Source: ** https://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-categorical-data/24#24
- If you have multiple categories you could code them as dummy variables, or better yet, transform them into a single classifier using principal components.  
- There is also a specialized form of hierarchical clustering known as a two step or two stage cluster model which is used by some products. This model handles both continuous and categorical variables.

**What went wrong?**
- Before performing the K means , I did a pairplot of the variables to check the relation. I realized that only continuous variables appeared int he pairplot
- So when I tried to add all predictors and do a K means clustering, it gave an error since I had categorical variable that were not dummy coded

**How to fix it?**
- First I will try to dummy code the variables, then try to pass it through K means clustering.

## What would be a good cluster number?
- Ref: 8.2.1

**Inertia vs. K clusters: the elbow method**
- Plot the inertia vs. the K number of clusters to get an idea of what the optimal number of clusters would be for the dataset. The "elbow" technique, though controversial, is a great heuristic to evaluate the optimal K. 

- Basically, we look for the K where the inertia has an "elbow": the point where decreases in inertia are considerably more marginal than for previous increases in K.


### Fit K means clustering model
- algorithm = auto; It chooses elkan algorithm for dense data and full for sparse data.
- copy_x = True; original data is not modified
- init = ‘k-means++’; selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
- max_iter = 300 iterations ina . single run to calculate the groups
- n_clusters = 6
- n_init = 10
- n_jobs = 1
- precompute_distances='auto'; Does not precompute distances when large memory is required
- random_state = None; No genertor s specified since the init method is not random
- tol = 0.0001
- verbose = 0

### Compute labels and centroids

## Plot the clusters

## Results of clustering


Method: 
    - The clustering models were fit onto dataframe without the label and the lat long. I chose to do this, because lat long will provide obvious clusters.
    - While plotting I plotted the newly found clusters along lat long to see the effect

> **Source** - https://stats.stackexchange.com/questions/58910/kmeans-whether-to-standardise-can-you-use-categorical-variables-is-cluster-3
- Discrete data is a larger issue. K-means is meant for continuous data. The mean will not be discrete, so the cluster centers will likely be anomalous. You have a high chance that the clustering algorithms ends up discovering the discreteness of your data, instead of a sensible structure.

- Categorical variables are worse. K-means can't handle them at all; a popular hack is to turn them into multiple binary variables (male, female). This will however expose above problems just at an even worse scale, because now it's multiple highly correlated binary variables.

- Since you apparently are dealing with survey data, consider using hierarchical clustering. With an appropriate distance function, it can deal with all above issues. You just need to spend some effort on finding a good measure of similarity.

- mca is a Multiple Correspondence Analysis (MCA) package for python, intended to be used with pandas. MCA is a feature extraction method; essentially PCA for categorical variables. You can use it, for example, to address multicollinearity or the curse of dimensionality with big categorical variables.

