In [11]:
import numpy as np
import pickle as pkl
import pandas as pd
import datetime as dt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


#### Part A\
\
Load the wspr.csv data set into a data frame. Are there any missing values?
Perform any necessary data imputation on numerical data in the data frame.

In [12]:
# 1.a
df = pd.read_table("/public/bmort/python/wspr.csv", sep = ",")

# data imputation step: replacing all "NaN" with 0

df.replace("NaN", pd.NA, inplace=True)
df.fillna(0, inplace=True)


I notice that there are numerical values in the data frame that are 0 and there are categorical values in the data frame are are marked as "NaN". I will perform data imputation by replacing all "NaN" values with 0.


#### Part B\
\

Produce a table of summary statistics on the data set. How do the ranges of the
values in the columns with numerical data compare? Does each column of numerical
data have similar magnitudes and ranges? Are there any outliers?


In [13]:
# 1.b

print(df.describe())

                 id           band         rx_lat         rx_lon  \
count  3.000000e+05  300000.000000  300000.000000  300000.000000   
mean   6.554775e+09      11.575247      41.067491     -60.402133   
std    9.996265e+04      57.971250      13.644301      59.148742   
min    6.554620e+09      -1.000000     -70.646000    -157.958000   
25%    6.554701e+09       7.000000      38.104000    -105.958000   
50%    6.554774e+09      10.000000      41.771000     -79.875000   
75%    6.554847e+09      14.000000      47.688000       0.292000   
max    6.557023e+09    2400.000000      68.354000     175.875000   

              tx_lat         tx_lon       distance        azimuth  \
count  300000.000000  300000.000000  300000.000000  300000.000000   
mean       39.733674     -62.201610    2275.101273     177.455867   
std        13.304164      56.596384    2172.390931     110.816167   
min       -87.521000    -173.042000       0.000000       0.000000   
25%        34.729000     -99.042000     88


The data is very volatile in my opinion. There are means that are positive, negative, and some averages have very large distances between other averages. The biggest outlier I can see is frequency column. It by far has the largest numerical values in the summary statistics.



#### Part C\
\
How many unique values are in each of the following columns: band, rx_sign, and
tx_sign?

In [14]:
# 1.c

count = 0
for i in range(len(df.band.unique())):
    count +=1
print(count)

count = 0
for i in range(len(df.rx_sign.unique())):
    count +=1
print(count)

count = 0
for i in range(len(df.tx_sign.unique())):
    count +=1
print(count)

23
891
1070



The column band has 23 unique values, the column rx_sign has 891 unique values, and the column tx_sign has 1070 unique values.



#### Part D\
\
What is the average distance (in km) between the transmitting station and the
receiving station for signals that have a power less than 30 dBm?


In [15]:
#1.d
avg_30 = []

for i in range(len(df.distance)):
    if df.power[i] < 30:
        avg_30.append(df.distance[i])
print(sum(avg_30) / len(avg_30))

2057.334727356264



The average distance between the transmitting station and the receiving station for signals that have a power less than 30dBm is
2057.334727356264 km.



#### Part E \
\
What is the call sign of the receiving station that received the most signal
transmissions on the 14 MHz band (i.e. band = 14) during the ten-minute period of
1:00 – 1:10 UTC? Hint: Use Python’s datetime library.


In [16]:
# 1.e
import pandas as pd

df['time'] = pd.to_datetime(df['time'])  

filtered_df = df[(df['band'] == 14) & 
    (df['time'].dt.time >= pd.to_datetime('1:00').time()) & 
    (df['time'].dt.time <= pd.to_datetime('1:10').time())]

max_call_sign = filtered_df['rx_sign'].value_counts().idxmax()

print(max_call_sign)

KFS



KFS is the call sign of the receiving station that received the most signal transmissions on the 14 MHz band during the ten-minute of 1:00 - 1:10 UTC.



#### Part F \
\
Partition the WSPR data set so that a random sample of 80% of the data will be used for training and 20% will be used for testing your machine learning model.

In [17]:
# 1.f
train, test = train_test_split(df, test_size=0.2, random_state = 42)


#### Part G\
\
Using Python’s scikit-learn library, generate a linear regression model to predict the
signal-to-noise ratio from the distance, frequency, and power.

In [18]:
# 1.g

X_train = train[['distance', 'frequency', 'power']]
y_train = train['snr']

X_test = test[['distance', 'frequency', 'power']]
y_test = test['snr']

model = LinearRegression()
model.fit(X_train, y_train)
print('Intercept:', model.intercept_)
print('Coefficients:', model.coef_)

Intercept: -20.244745492655323
Coefficients: [-1.07415430e-03  1.25381904e-09  2.98849527e-01]



The linear regression model for the signal-to-noise ratio based distance, frequency, and power in the WSPR data set is: $\hat{y} = -20.24474549265532 - 1.07415430*10^{-3}X_1 - 1.25381904*10^{-9}X_2 + 2.98849527*10^{-1}X_3$



#### Part H \ 
\
\
Use the 20% of the data set aside for testing to determine the accuracy of your
model. Choose an appropriate accuracy metric. How well does your model predict
the signal to noise values from distance, frequency, and power? Comment on why
the accuracy is good or poor.

In [22]:
# 1.h

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 75.18509424244381
R-squared: 0.1022570484041998



The accuracy metrics I used for this test data are mean squared error and $r^2$ score. Having a lower mean squared error means there is more accurate data. Having an $r^2$ score closer to one indicates the model is a good fit for the data. In the case of the training data the mean squared error is 75.19 and the $r^2$ score is 0.10. I believe from the two values given that the accuracy of this linear regression is poor. The linear regression does not fit the data well and the mean squared error is high which means there are a lot of outliers that may be affecting this linear regression model.



#### Part I \ 
\
\
What is the predicted signal to noise value for a receiver that is located 2,000 km
from a transmitter that uses a frequency of 14,030,000 Hz and a power level of 31
dBm? How confident are you in the answer? Explain your reasoning. 

In [None]:
# 1.i

prediction = -20.24474549265532 - 1.07415430*10**(-3)*2000 - \
    1.25381904*10**(-9)*14030000 + 2.98849527*10**(-1)*31
print(prediction)


The predicted signal-to-noise ratio is -13.14630983678652
