Import the nessecary libraries, and add labels to the dataset.
Create a CSV file from that new data set and save it as a CSV file.
Use that CSV file as the main data set from now on.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from zlib import crc32
from sklearn.model_selection import train_test_split


'''
#---------read the attribute names first------------------
with open('Names.txt', 'r') as file:
    attributes = file.readlines()
columnNames = [line.split()[-2] for line in attributes if line.startswith('@attribute')]        #we are only interested in the name itself, delete everything else

#---------read the data-----------------
dataset = pd.read_csv('communities.data', header=None)
dataset.columns = columnNames
#dataset.to_csv('DatasetWithHeaders', index=False)      #Only need to run this once, so therefore its out
'''
datasetWithHeaders = pd.read_csv('DatasetWithHeaders')  #The new dataset with labels.
print(datasetWithHeaders.head())



   state county community        communityname  fold  population  \
0      8      ?         ?         Lakewoodcity     1        0.19   
1     53      ?         ?          Tukwilacity     1        0.00   
2     24      ?         ?         Aberdeentown     1        0.00   
3     34      5     81440  Willingborotownship     1        0.04   
4     42     95      6096    Bethlehemtownship     1        0.01   

   householdsize  racepctblack  racePctWhite  racePctAsian  ...  LandArea  \
0           0.33          0.02          0.90          0.12  ...      0.12   
1           0.16          0.12          0.74          0.45  ...      0.02   
2           0.42          0.49          0.56          0.17  ...      0.01   
3           0.77          1.00          0.08          0.12  ...      0.02   
4           0.55          0.02          0.95          0.09  ...      0.04   

   PopDens  PctUsePubTrans  PolicCars  PolicOperBudg  LemasPctPolicOnPatr  \
0     0.26            0.20       0.06           0.0

Check how many columns contain missing values (denoted with "?") and find how much percent of that feature has missing values in it.

In [22]:

#datasetWithHeaders.hist(bins = 50, figsize=(12,8) )
#plt.show

columnsWithQuestionMark = [column for column in datasetWithHeaders.columns if datasetWithHeaders[column].astype(str).str.contains('\?').any()]
columnPercentages = {}
listOfFeaturesWithMissingValue = []
# Calculate the percentage of "?" in each column and store the results
for column in columnsWithQuestionMark:
    percentQuestionMark = (datasetWithHeaders[column] == "?").mean() * 100
    columnPercentages[column] = percentQuestionMark
    listOfFeaturesWithMissingValue.append(column)
#Print the percentages for columns with "?"
for column, percentage in columnPercentages.items():
    print(f"Percentage of '?' values in column '{column}': {percentage:.2f}%")

print(listOfFeaturesWithMissingValue)



Percentage of '?' values in column 'county': 58.88%
Percentage of '?' values in column 'community': 59.03%
Percentage of '?' values in column 'OtherPerCap': 0.05%
Percentage of '?' values in column 'LemasSwornFT': 84.00%
Percentage of '?' values in column 'LemasSwFTPerPop': 84.00%
Percentage of '?' values in column 'LemasSwFTFieldOps': 84.00%
Percentage of '?' values in column 'LemasSwFTFieldPerPop': 84.00%
Percentage of '?' values in column 'LemasTotalReq': 84.00%
Percentage of '?' values in column 'LemasTotReqPerPop': 84.00%
Percentage of '?' values in column 'PolicReqPerOffic': 84.00%
Percentage of '?' values in column 'PolicPerPop': 84.00%
Percentage of '?' values in column 'RacialMatchCommPol': 84.00%
Percentage of '?' values in column 'PctPolicWhite': 84.00%
Percentage of '?' values in column 'PctPolicBlack': 84.00%
Percentage of '?' values in column 'PctPolicHisp': 84.00%
Percentage of '?' values in column 'PctPolicAsian': 84.00%
Percentage of '?' values in column 'PctPolicMinor

25 features have missing data, however 22 of the features have 84% of their data missing so we'll remove those from the list.
For "OtherPerCap" (which is referring to per capita income of whose ethnicity is other than the ones listed in the dataset) we will
use the mean and fill that in for the missing values.

In [23]:
listOfFeaturesWithMissingValue = listOfFeaturesWithMissingValue[3:]
newDataSet = datasetWithHeaders.drop(columns=listOfFeaturesWithMissingValue)

# You can save the new dataset to a new CSV file if needed
newDataSet.to_csv('new_dataset.csv', index=False)

We have a supervised task. Its a multiple regression (a univariate regression because we are only trying to predict a single value) task, and we will use batch learning.

We will start with setting aside 20% of the data for Testing. This will be chosen randomly.



In [29]:


newTrainset, newTestSet = train_test_split(newDataSet, test_size=0.2, random_state=42)
print(f"Length of newTrainSet: {len(newTrainset)}") #Should be roughly 80% of the data
print(f"Length of newTestSet: {len(newTestSet)}") #Should be roughly 20% of the data
print(newTrainset.head())



Length of newTrainSet: 1595
Length of newTestSet: 399
      state county community        communityname  fold  population  \
1378     28      ?         ?          Jacksoncity     7        0.30   
1826     34     31     60090  PomptonLakesborough    10        0.00   
678      12      ?         ?            Daniacity     4        0.00   
1083     25      3     46225       NorthAdamscity     6        0.01   
1558      5      ?         ?           Bentoncity     8        0.01   

      householdsize  racepctblack  racePctWhite  racePctAsian  ...  \
1378           0.48          1.00          0.14          0.03  ...   
1826           0.45          0.01          0.96          0.09  ...   
678            0.23          0.61          0.50          0.03  ...   
1083           0.38          0.03          0.97          0.03  ...   
1558           0.41          0.08          0.93          0.01  ...   

      PctForeignBorn  PctBornSameState  PctSameHouse85  PctSameCity85  \
1378            0.03     