# Access Point Wifi
### Dataset by @PavelBiz user

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Dataset
### Preprocess and visualization

In [None]:
df = pd.read_csv('/kaggle/input/wifi-of-the-access-point-where-i-live/wifipoints.csv')
df.head()

In [None]:
df.info()

All dataframe values are valid.

`maximumspeeds` is not parsed as numerical because of the unit.

In [None]:
# get speed units
from string import digits, Template
df.maximumspeeds.apply( lambda speed: ''.join([c for c in speed if not c.isdigit() and not c.isspace()]) ).unique()

All speeds are in _Mbps_ so it can be removed from the data.

In [None]:
# rename column to provide the speed unit
df = df.rename(columns = {'maximumspeeds': 'maximumspeedsMbps'})

# remove speed unit and parse
df.maximumspeedsMbps = df.maximumspeedsMbps.apply( lambda speed: ''.join([x for x in speed if x.isdigit()]) )
df.maximumspeedsMbps = df.maximumspeedsMbps.astype(float)

df.info()

Following cells show information and plots about values in the dataset.

In [None]:
df.describe(include = 'all')

In [None]:
import plotly.express as px

# plot histos about data features
for col in ['authentications', 'ciphers', 'phytypes', 'channelsnumber', 'maximumspeedsMbps']:
    px.histogram(
        df,
        x = col,
        width = 500, height = 300,
        histnorm = 'percent'
    ).show()

From previous histograms can be noiced that `ciphers` and `phytypes` contain missing values, respectively `_None` and `_` (`_` is used as a placeholder for whitespaces).
These data can be replaced with `NaN`.

In [None]:
# replace missing values with NaN
df.loc[df.ciphers == ' None', 'ciphers'] = df.loc[df.ciphers == 'None', 'ciphers'].apply(lambda null: np.nan)
df.loc[df.phytypes == ' ', 'phytypes'] = df.loc[df.phytypes == '', 'phytypes'].apply(lambda null: np.nan)

df.loc[:, ['ciphers', 'phytypes']].info()

## Quantitative features
### Maximum speed vs channel number

In [None]:
# correlatino of quantitative features
df.corr()

Correlation between channel number and maximum speed is at $\rho = 0.65$; this could be synonymous with a causal link between these two features.

In [None]:
px.scatter(
    df, title = 'Maximum speed vs channel',
    x = 'channelsnumber', y = 'maximumspeedsMbps',
    hover_name = 'ssids', hover_data = df.columns.tolist(),
    width = 800, height = 500
)

$x$ axis is discrete and it is normal to see data aligned along specific x values.
The strangeness is having values of different channels with the same maximum speed as if $y$ axis is discrete too.
To identify points with similar maximum speeds a KMeans clustering can be performed over `maximumspeedsMbps`.

In [None]:
from sklearn.cluster import KMeans

# perform KMeans with different numbers of clusters

n_clusters_range = range(10,21)
inertia = pd.DataFrame({
    'n_clusters': [n for n in n_clusters_range],
    'inertia': [np.nan for n in n_clusters_range] 
})

for n in n_clusters_range:
    model = KMeans(n_clusters = n, random_state = 1)
    model = model.fit(df.maximumspeedsMbps.values.reshape(-1,1))
    inertia.loc[inertia.n_clusters == n, 'inertia'] = model.inertia_
    
px.line(
    inertia, title = 'KMeans inertia',
    x = 'n_clusters', y = 'inertia',
    width = 800, height = 400
)

KMeans failed retrieving more than 16 distinct clusters, so last results are not to be taken in consideration.

In [None]:
# plot KMeans 16 results

model = KMeans(n_clusters = 16, random_state = 1)
df['clusters'] = model.fit_predict(df.maximumspeedsMbps.values.reshape(-1,1))

df = df.sort_values(by = 'clusters')
df.clusters = df.clusters.astype(str) # force plotly using discrete color sequence instead of continuous scale

px.scatter(
    df, title = 'Maximum speed vs channel',
    x = 'channelsnumber', y = 'maximumspeedsMbps',
    hover_name = 'ssids', hover_data = df.columns.tolist(),
    width = 800, height = 500,
    color = 'clusters'
).show()

df.clusters = df.clusters.astype(int)
df = df.sort_index()

After isolating the data with similar speeds, qualitative features could be studied to highlight some characterizing property for the clusters.

In [None]:
df.drop(columns = ['channelsnumber', 'maximumspeedsMbps']).groupby('clusters').describe()