# Outline
- [1 - Importing the required Library and Packages](#1)
- [2 - Checking the Null Values](#2)
    - [ 2.1 Replacing the null value](#2.1)
- [3 - Dropping Unncessary Columns](#3)
- [4 - Standardizing and Consilated Target Variable 'DAMAGE'](#4)
  - [4.1 Saving File to avoid re-cleaning the data](#4.1)
- [5 Google MAPS API to fill missing data](#5)
  - [5.1 Saving File to avoid rerunning API](#5.1)
- [6 - Reading the Address filled Saved Data](#6)
  - [6.1 - Checking the percentage for each DAMAGE Class](#6.1)
- [7 - Correcting misspelled community names and providing accurate names](#7)
  - [7.1 - Mapping Community](#7.1)
  - [7.2 - Saving File](#7.2)
- [8 - Filling the Communities column based on the nearest neighbor](#8)
  - [8.1 Saving File](#8.1)
- [9 - Filling the Topography Column based on nearest neighbour within the same CITY, COUNTY and COMMUNITY](#9)
  - [9.1 Saving File](#9.1)
- [10 - Checking the Unknwown Presence in the HOUSE COMPONENTS](#10)
  - [10.1 Dropping Rows with no Coordinates](#10.1)
  - [10.2 Removing rows with Unknown Present in all House components attribute](#10.2)
  - [10.3 Replacing repetitive values in house attributes with the correct ones (EAVES, WINDOWPANE)](#10.3)
  - [10.4 Mapping VegClearan](#10.4)
  - [10.5 DROPPING rows based on the IncidentTST](#10.5)
  - [10.6 Saving File](#10.6)
  - [10.7 Mapping, Standardizing and Consolidating Housing StuctureType ](#10.7)
- [11 - Saving the Final Cleaned Version File](#11)

<a name="1"></a>
## 1 - Importing the required Library and Packages

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [346]:
# Disable scientific notation for pandas
# pd.options.display.float_format = '{:.2f}'.format
cali_wildfire_og_df = pd.read_csv('combined_structure_status.csv')
df = pd.read_csv('combined_structure_status.csv')
cali_wildfire_og_df

Unnamed: 0,DAMAGE,STREETNUMB,STREETNAME,STREETTYPE,STREETSUFF,CITY,STATE,CALFIREUNI,COUNTY,COMMUNITY,...,TOPOGRAPHY,APN,ASSESSEDIM,YEARBUILT,SITEADDRES,GLOBALID,GEOMETRY,LONGITUDE,LATITUDE,CLAIM
0,Destroyed (>50%),1480,Patrick,Drive,154,Paradise Northwest A,CA,BTU,Butte,Paradise,...,Flat Ground,050-150-111-000,2756731.0,1900.0,1400 KILCREASE CIR PARADISE CA 95969,{44673E14-1CA5-4AE5-BAD2-FBB23CEE6E17},POINT (-121.5905677572311 39.7826496849055),-121.590568,39.782650,629141.49
1,Destroyed (>50%),1480,Patrick,Drive,156,Paradise Northwest A,CA,BTU,Butte,Paradise,...,Flat Ground,050-150-111-000,2756731.0,1900.0,1400 KILCREASE CIR PARADISE CA 95969,{0D5163EE-231B-43C6-8B61-FF7301765CAE},POINT (-121.5903517061171 39.78267012986168),-121.590352,39.782670,812980.07
2,Destroyed (>50%),1480,Patrick,Drive,158,Paradise Northwest A,CA,BTU,Butte,Paradise,...,Flat Ground,050-150-111-000,2756731.0,1900.0,1400 KILCREASE CIR PARADISE CA 95969,{316A2D22-D5CE-4FFD-BE9A-A39485DD7FC3},POINT (-121.5901607359658 39.78265522691407),-121.590161,39.782655,349805.49
3,Destroyed (>50%),1480,Patrick,Drive,160,Paradise Northwest A,CA,BTU,Butte,Paradise,...,Flat Ground,050-150-111-000,2756731.0,1900.0,1400 KILCREASE CIR PARADISE CA 95969,{F82B4C07-4405-48CC-AFF3-2305FD6AE820},POINT (-121.5899248446604 39.78265953460353),-121.589925,39.782660,593805.73
4,Affected (1-9%),1480,Patrick,Drive,162,Paradise Northwest A,CA,BTU,Butte,Paradise,...,Flat Ground,050-150-111-000,2756731.0,1900.0,1400 KILCREASE CIR PARADISE CA 95969,{45F2AE98-8578-436E-A18B-C9739D38CC00},POINT (-121.5899292697615 39.78285534848954),-121.589929,39.782855,22349.39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31679,Destroyed,5547,Wolf,Trail,,,CA,BTU,YUB,Loma Rica,...,Ridge Top,040130002000,20679.0,0.0,5547 WOLF TRL LOMA RICA CA 95901,{69C72D5F-FD50-4C70-954F-89F48076CFA2},POINT (-121.4056712799537 39.33035558970509),-121.405671,39.330356,309425.31
31680,Destroyed,5547,Wolf,Trail,,,CA,BTU,YUB,Loma Rica,...,Ridge Top,040130002000,20679.0,0.0,5547 WOLF TRL LOMA RICA CA 95901,{3FD66424-FB70-411B-BE0B-B0CCD3CDA18D},POINT (-121.4058839695719 39.3303358601046),-121.405884,39.330336,535016.99
31681,Destroyed,5614,Wolf,Trail,,,CA,BTU,YUB,Loma Rica,...,Slope,040130008000,35000.0,1980.0,5614 WOLF TRL LOMA RICA CA 95901,{F7ED329B-F83A-4DBE-B7AC-BAB5BE6F0EBA},POINT (-121.4069208293274 39.32686790895899),-121.406921,39.326868,474123.57
31682,Destroyed,15402,Vierra,,,,CA,BTU,YUB,,...,,056220018000,30939.0,0.0,15402 VIERRA RD RACKERBY CA 95972,{E024929E-428F-45A7-8C0C-4C547A532DB7},POINT (-121.3578086131194 39.40689924995728),0.000000,0.000000,464593.54


<a name="2"></a>
## 2 - Checking the null values

In [97]:
column_analysis_df = pd.DataFrame({
    'Data Type': cali_wildfire_og_df.dtypes.astype(str),
    'Unique Values': cali_wildfire_og_df.nunique(),
    'Missing Values': cali_wildfire_og_df.isnull().sum()
}).reset_index().rename(columns={'index': 'Column'})

column_analysis_df


Unnamed: 0,Column,Data Type,Unique Values,Missing Values
0,DAMAGE,object,10,0
1,STREETNUMB,object,7798,236
2,STREETNAME,object,2248,50
3,STREETTYPE,object,17,1113
4,STREETSUFF,object,838,18208
5,CITY,object,27,6640
6,STATE,object,2,4
7,CALFIREUNI,object,5,0
8,COUNTY,object,11,0
9,COMMUNITY,object,179,3302


<a name="2.1"></a>
### 2.1 - Replacing the null value with Value '`Unknown`' to handle the Data cleaning steps

In [98]:
# Replace all blank values, '-', and 'Unknown' with 'Unknown' for all columns
for column in cali_wildfire_og_df.columns:
    if cali_wildfire_og_df[column].dtype == 'O':  # For object type columns
        # cali_fireregion_data.fillna({column: 'Unknown'}, inplace=True)
        cali_wildfire_og_df[column] = cali_wildfire_og_df[column].replace(['-', '', ' ', 'Unknown', 'UnKnown', 'unknown', 'UNKNOWN', None], 'Unknown')
    else:  # For numerical columns, convert to string 'Unknown'
        cali_wildfire_og_df[column] = cali_wildfire_og_df[column].apply(lambda x: 'Unknown' if pd.isna(x) else x)

# Verify that there are no missing values left
cali_wildfire_og_df.isna().sum()

DAMAGE        0
STREETNUMB    0
STREETNAME    0
STREETTYPE    0
STREETSUFF    0
CITY          0
STATE         0
CALFIREUNI    0
COUNTY        0
COMMUNITY     0
INCIDENTNA    0
INCIDENTNU    0
INCIDENTST    0
HAZARDTYPE    0
VEGCLEARAN    0
STRUCTURET    0
ROOFCONSTR    0
EAVES         0
VENTSCREEN    0
EXTERIORSI    0
WINDOWPANE    0
TOPOGRAPHY    0
APN           0
ASSESSEDIM    0
YEARBUILT     0
SITEADDRES    0
GLOBALID      0
GEOMETRY      0
LONGITUDE     0
LATITUDE      0
CLAIM         0
dtype: int64

<a name="3"></a>
## 3 - Dropping the irrelevant columns

In [99]:
# Drop the irrelevant columns
columns_to_drop = ['STREETNUMB', 'STREETSUFF', 'STATE', 'HAZARDTYPE', 'APN', 'SITEADDRES']
cali_wildfire_dropped_columns = cali_wildfire_og_df.drop(columns=columns_to_drop)

In [100]:
cali_wildfire_dropped_columns['DAMAGE'].unique()

array(['Destroyed (>50%)', 'Affected (1-9%)', 'Major (26-50%)',
       'Minor (10-25%)', '26-50%', 'Destroyed', '1-9%', 'No Damage',
       '10-25%', '51-75%'], dtype=object)

<a name="4"></a>
## 4 - Standardizing and Consolidating Target Variable 'DAMAGE'

In [101]:
# Mapping for the DAMAGE column for their respective categories to consolidate overlapping categories
# Define the mapping dictionary
damage_mapping = {
    '1-9%': 'Affected (1-9%)',
    'Affected (1-9%)': 'Affected (1-9%)',
    '10-25%': 'Minor (10-25%)',
    '26-50%': 'Major (26-50%)',
    '51-75%': 'Destroyed (>50%)',
    'Destroyed': 'Destroyed (>50%)',
    'Destroyed (>50%)': 'Destroyed (>50%)',
    'No Damage': 'No Damage'
}

# Apply the mapping to the DAMAGE column
cali_wildfire_dropped_columns['DAMAGE'] = cali_wildfire_dropped_columns['DAMAGE'].replace(damage_mapping)

In [102]:
cali_wildfire_dropped_columns['DAMAGE'].value_counts()

DAMAGE
Destroyed (>50%)    29383
Affected (1-9%)      1256
No Damage             575
Minor (10-25%)        310
Major (26-50%)        160
Name: count, dtype: int64

<a name="4.1"></a>
### 4.1 - Saving File to avoid re-cleaning the data

In [169]:
import os

csv_folder = 'csv_data'
if not os.path.exists(csv_folder):
    os.makedirs(csv_folder)

# Define the path for the CSV file
csv_file_path = os.path.join(csv_folder, 'cali_wildfire_data.csv')

# Write the DataFrame to a CSV file
cali_wildfire_dropped_columns.to_csv(csv_file_path, index=False)

In [170]:
# cali_wildfire_dropped_columns[(cali_wildfire_dropped_columns['STREETNAME'] == 'Unknown') | 
# (cali_wildfire_dropped_columns['STREETTYPE'] == 'Unknown') | 
# (cali_wildfire_dropped_columns['COUNTY'] == 'Unknown') | 
# (cali_wildfire_dropped_columns['CITY'] == 'Unknown')]

<a name="5"></a>
## 5 - Using Google MAPS API to fill unknown data
#### Filling the Address Components through Google Maps API using Latitude and Longitude

In [334]:
import googlemaps
from time import sleep

gmaps = googlemaps.Client(key='AIzaCiD_Bq2AHWfbvFwTJGH5eVkKev4WOFY86VI') ### API Key have been changed

In [172]:
def extract_address_components(geocode_results):
    """Extract address components by checking multiple results"""
    components = {
        'street_name': None,
        'street_type': None,
        'city': None,
        'county': None
    }
    
    # Check if we have any results
    if not geocode_results:
        return components
    
    # Function to extract components from a single result
    def extract_from_result(address_components):
        found_components = {}
        
        for component in address_components:
            # Extract route (street name and type)
            if 'route' in component['types'] and not found_components.get('street_name'):
                route_parts = component['long_name'].split()
                if len(route_parts) > 1:
                    found_components['street_name'] = ' '.join(route_parts[:-1])
                    found_components['street_type'] = route_parts[-1]
            
            # Extract city
            elif 'locality' in component['types'] and not found_components.get('city'):
                found_components['city'] = component['long_name']
            
            # Extract county
            elif 'administrative_area_level_2' in component['types'] and not found_components.get('county'):
                found_components['county'] = component['long_name'].replace(' County', '')
        
        return found_components
    
    # Try each result until we find all components or run out of results
    for result in geocode_results:
        if 'address_components' not in result:
            continue
            
        found = extract_from_result(result['address_components'])
        
        # Update components with any new findings
        for key, value in found.items():
            if components[key] is None and value is not None:
                components[key] = value
        
        # If we found all components, we can stop
        if all(components.values()):
            break
    
    # Fill any remaining None values with 'Unknown'
    for key in components:
        if components[key] is None:
            components[key] = 'Unknown'
    
    return components


def update_address_data(df):
    # Create mask for rows needing updates
    mask = (
        (df['STREETNAME'] == 'Unknown') |
        (df['STREETTYPE'] == 'Unknown') |
        (df['CITY'] == 'Unknown') |
        (df['COUNTY'] == 'Unknown')
    )
    
    # Get rows that need updating
    rows_to_update = df[mask].copy()
    
    # Process rows that need updating
    for idx, row in rows_to_update.iterrows():
        try:
            # Reverse geocode the coordinates
            results = gmaps.reverse_geocode((row['LATITUDE'], row['LONGITUDE']))
            components = extract_address_components(results)
            
            # Update components if they were Unknown
            if row['STREETNAME'] == 'Unknown' and components['street_name'] != 'Unknown':
                df.at[idx, 'STREETNAME'] = components['street_name']
            if row['STREETTYPE'] == 'Unknown' and components['street_type'] != 'Unknown':
                df.at[idx, 'STREETTYPE'] = components['street_type']
            if row['CITY'] == 'Unknown' and components['city'] != 'Unknown':
                df.at[idx, 'CITY'] = components['city']
            if row['COUNTY'] == 'Unknown' and components['county'] != 'Unknown':
                df.at[idx, 'COUNTY'] = components['county']
            
            # Respect API rate limits
            sleep(0.2)
            
        except Exception as e:
            print(f"Error processing row {idx}: {e}")
            continue
    
    return df


In [None]:
%%time
cali_wildfire_filled_address_df = update_address_data(cali_wildfire_dropped_columns)

In [336]:
cali_wildfire_filled_address_df[((cali_wildfire_filled_address_df['STREETNAME'] == 'Unknown') | 
(cali_wildfire_filled_address_df['STREETTYPE'] == 'Unknown') | 
(cali_wildfire_filled_address_df['COUNTY'] == 'Unknown') | 
(cali_wildfire_filled_address_df['CITY'] == 'Unknown')) & (cali_wildfire_filled_address_df['DAMAGE'] == 'Destroyed (>50%)')]

<a name="5.1"></a>
### 5.1 - Saving the File to avoid rerunning the GoogleAPI (takes 1.5 hours) 

In [None]:
import os

csv_folder = 'csv_data'
if not os.path.exists(csv_folder):
    os.makedirs(csv_folder)

# Define the path for the CSV file
csv_file_path = os.path.join(csv_folder, 'cali_wildfire_filled_address_df.csv')

# Write the DataFrame to a CSV file
cali_wildfire_filled_address_df.to_csv(csv_file_path, index=False)

In [None]:
cali_wildfire_filled_address_df

In [None]:
cali_wildfire_filled_address_df[cali_wildfire_filled_address_df['CITY'] == 'Unknown']

In [None]:
# Drop the irrelevant columns
columns_to_drop = ['STREETNAME', 'STREETTYPE', 'GLOBALID', 'GEOMETRY']
cali_wildfire_filled_address_df = cali_wildfire_filled_address_df.drop(columns=columns_to_drop)

In [None]:
import os

csv_folder = 'csv_data'
if not os.path.exists(csv_folder):
    os.makedirs(csv_folder)

# Define the path for the CSV file
csv_file_path = os.path.join(csv_folder, 'cali_wildfire_filled_address_df_drop_columns.csv')

# Write the DataFrame to a CSV file
cali_wildfire_filled_address_df.to_csv(csv_file_path, index=False)

<a name="6"></a>
## 6 - Reading the Address filled Saved Data

In [5]:
cali_wildfire_filled_address_df = pd.read_csv('csv_data/cali_wildfire_filled_address_df_drop_columns.csv')

In [6]:
cali_wildfire_filled_address_df.groupby(['DAMAGE'], as_index=False).agg(
    CITY_Count = ('CITY', 'count')
)

Unnamed: 0,DAMAGE,CITY_Count
0,Affected (1-9%),1256
1,Destroyed (>50%),29381
2,Major (26-50%),160
3,Minor (10-25%),310
4,No Damage,575


<a name="6.1"></a>
### 6.1 - Checking the percentage for each DAMAGE Class

In [7]:
# First, group by 'DAMAGE' and calculate CITY_Count
damage_grouped = cali_wildfire_filled_address_df.groupby(['DAMAGE'], as_index=False).agg(
    CITY_Count=('CITY', 'count')
)

# Now calculate the CITY_Percentage
damage_grouped['CITY_Percentage'] = damage_grouped['CITY_Count'] / len(cali_wildfire_filled_address_df)

# View the result
damage_grouped


Unnamed: 0,DAMAGE,CITY_Count,CITY_Percentage
0,Affected (1-9%),1256,0.039641
1,Destroyed (>50%),29381,0.927313
2,Major (26-50%),160,0.00505
3,Minor (10-25%),310,0.009784
4,No Damage,575,0.018148


In [8]:
# cali_wildfire_filled_address_df['CITY'].unique()

In [9]:
cali_wildfire_filled_address_df[cali_wildfire_filled_address_df['TOPOGRAPHY'] == 'Ridge Top'].head(1)

Unnamed: 0,DAMAGE,CITY,CALFIREUNI,COUNTY,COMMUNITY,INCIDENTNA,INCIDENTNU,INCIDENTST,VEGCLEARAN,STRUCTURET,...,EAVES,VENTSCREEN,EXTERIORSI,WINDOWPANE,TOPOGRAPHY,ASSESSEDIM,YEARBUILT,LONGITUDE,LATITUDE,CLAIM
198,Destroyed (>50%),Paradise Southeast A,BTU,Butte,Paradise,Camp,CA-BTU-016737,11/8/2018,>100',Single Family Residence Single Story,...,Enclosed,"Mesh Screen > 1/8""",Combustible,Multi Pane,Ridge Top,193052.0,1965.0,-121.585719,39.769048,694137.5


In [12]:
cali_wildfire_filled_address_df[(cali_wildfire_filled_address_df['CITY'] == 'Calistoga')][['CITY', 'COUNTY', 'COMMUNITY']].value_counts()

CITY       COUNTY  COMMUNITY   
Calistoga  SON     Franz Valley    223
           NAP     Calistoga       106
Name: count, dtype: int64

<a name="7"></a>
## 7 - Correcting misspelled community names and providing accurate names.

In [311]:
cali_wildfire_filled_address_df['COMMUNITY'].unique()

array(['Paradise', 'Magalia', 'Oroville', 'Concow', 'Unknown',
       'Paradise Treatment Cente', 'Skyway Antique Mall', 'P',
       'Paradise Adventist Academy', 'A', 'Villa Monterey Apartments',
       'Chico', 'Camino Apartments', 'dfg', 'Butte Valley', 'Yankee Hill',
       'Forest Ranch', 'Honey Run', 'Shadowbrook Apartments',
       'Academy Oaks Apartments', 'Los Angeles County', 'Malibu',
       'Malibu Springs', 'Charmlee Wilderness Park', 'Agoura Hills',
       'Los Virgenes Homeowners Association', 'Calabasas',
       'South Ridge Heights', 'Oak Forest', 'Oak Village', 'Oak',
       'Seminole Springs Mobile Park', 'Seminole Springs',
       'Hidden Park Estates', 'Agoura hills', 'Malibu Lake',
       'Saratoga Hills', "Brent's Junction", 'Point Dume Club of Malibu',
       'Oak Park', 'Residential', 'Westlake', 'Morrison Sutton Valley',
       'Bell Canyon', 'Thousand Oaks', 'Wilshire Boulevard Temple Camps',
       'Vallecito Mobile Community', 'High Knoll', 'Camarillo',
  

<a name="7.1"></a>
### 7.1 - Mapping community with correct names.

In [14]:
# Create a mapping dictionary for corrections
community_mapping = {
    'paradise': 'Paradise', 'PARADISE': 'Paradise', 'Para': 'Paradise', 'Paridise': 'Paradise', 'Pasadise': 'Paradise',
    'Paradisr': 'Paradise', 'Pradadise': 'Paradise', 'Paradie': 'Paradise', 'Laradise': 'Paradise',
    
    'magalia': 'Magalia', 'Magelia': 'Magalia',
    
    'Oraville': 'Oroville', 'Or': 'Oroville', 'Orville': 'Oroville', 'Oriville': 'Oroville', 'Orovolle': 'Oroville',

    'Paradise Treatment Ctr': 'Paradise Treatment Cente',

    'Villa Monterey Apts.': 'Villa Monterey Apartments',

    'Camino Apts': 'Camino Apartments',
    
    'Butted Valley': 'Butte Valley',

    'Shadowbrook Apts.': 'Shadowbrook Apartments', 
    
    'Academy Oaks Apt': 'Academy Oaks Apartments', 'Academy Oaks Apt.': 'Academy Oaks Apartments',
    
    'malibu': 'Malibu', 'Mal': 'Malibu', 'mal': 'Malibu', 'mal6': 'Malibu',

    'Agoura Hills': 'Agoura Hills', 'Agoura Hills': 'Agoura Hills', 'Ca 91301': 'Agoura Hills', 
    'Agoura Hills, Ca 91301': 'Agoura Hills',  'Angora Hills': 'Agoura Hills',
    'Agoura Hills Ca': 'Agoura Hills', 'Agora':'Agoura Hills', 'Agoura': 'Agoura Hills', 'Augora Hills': 'Agoura Hills',

    'Seminole Springs mobile home park': 'Seminole Springs Mobile Park',
    'Seminole Springs Mobile Home Park': 'Seminole Springs Mobile Park',
    'Seminole Springs mobile home': 'Seminole Springs Mobile Park', 
    'Seminole Springs home park': 'Seminole Springs Mobile Park',

    'Los Virgines Homeowners Association': 'Los Virgenes Homeowners Association',
    
    'South Ridge': 'South Ridge Heights', 'South Ridge Heights': 'South Ridge Heights',
    
    'napa': 'Napa', 'Naoa': 'Napa', 'Nspa': 'Napa',

    'Malibu Lake': 'Malibu Lake', 'Malibou Lakes': 'Malibu Lake', 'Malibou Lake': 'Malibu Lake', 
    'Malibu Lakes': 'Malibu Lake', 'Lake Malibu': 'Malibu Lake',

    'Bell Canyon': 'Bell Canyon', 'Bell canyon': 'Bell Canyon',

    'Wiltshire Boulevard Temple Camps': 'Wilshire Boulevard Temple Camps', 
    'Wilshire Boulevard Temple Camps': 'Wilshire Boulevard Temple Camps', 
    'Wilshire Boulevard Temple Camp': 'Wilshire Boulevard Temple Camps',

    'Vallecito Mobile Community':'Vallecito Mobile Community', 
    'Vallecito Community': 'Vallecito Mobile Community',

    'Ventura County':'Ventura County', 'Ventura County': 'Ventura County',

    'Leo Carrillo Campground':'Leo Carrillo State Beach',
    
    'Leo Carrillo': 'Leo Carrillo State Beach',

    'Trancas Vinyard': 'Trancas Vineyard',

    'Zuma': 'Zuma Beach', 'Zuma Beach': 'Zuma Beach',

    'Silverado Country Club': 'Silverado Country Club', 'Silverado Country': 'Silverado Country Club',

    'La Cresenda': 'La Crescenta',
    
    'glen ellen': 'Glen Ellen', 'Glenn Ellen': 'Glen Ellen',
    
    'bangor': 'Bangor',
    
    'Somoma': 'Sonoma', 'Sinoma': 'Sonoma',

    'Capel': 'Capp Valley', 'Capel Valley': 'Capp Valley',

    'Myacamas': 'Mayacamas', 'Mayacamas':'Mayacamas',
    
    'cherokee': 'Cherokee',

    'Bangor': 'Bangor', 'bangor': 'Bangor',
    
    'Rough  and  Ready': 'Rough and Ready', 'Rough & Ready': 'Rough and Ready',

    'Goose point': 'Goose Point', 'Goose Point': 'Goose Point',
    
    'Holiday Island Mobile home park': 'Holiday Island Mobile Home Park',
    
    'Redwood valley': 'Redwood Valley', 'Redwoof Valley': 'Redwood Valley',
    'Redwood valley/ Kickapoo lane': 'Redwood Valley',
    'Redwood valley/ Kickapoo ln': 'Redwood Valley',
    'Redwood valley/ Kickapoo Ln': 'Redwood Valley',
    'Redwood valley /Kickapoo ln': 'Redwood Valley',
    'Redwood valley/Kickapoo Ln': 'Redwood Valley',
    'Redwood valley/Kickapoo Lane': 'Redwood Valley',
    'Redwood  Valley': 'Redwood Valley',
    
    'loma rica': 'Loma Rica', 'Lima Rica': 'Loma Rica', 'Lima rica': 'Loma Rica',
    
    'Ventura  County': 'Ventura County'
}

In [15]:
# Apply the mapping to correct community names
cali_wildfire_filled_address_df['COMMUNITY'] = cali_wildfire_filled_address_df['COMMUNITY'].replace(community_mapping)

# Identify remaining communities that need verification
needs_verification = ['P', 'A', 'dfg', 'Unknown', 'D', 'Dry Gulch/Malibu Saloon']

# Print communities that still need verification
print("Communities that need further verification:")
for community in needs_verification:
    print(community)


# Print unique community names after correction
print("\nUnique community names after correction:")
print(cali_wildfire_filled_address_df['COMMUNITY'].unique())

Communities that need further verification:
P
A
dfg
Unknown
D
Dry Gulch/Malibu Saloon

Unique community names after correction:
['Paradise' 'Magalia' 'Oroville' 'Concow' 'Unknown'
 'Paradise Treatment Cente' 'Skyway Antique Mall' 'P'
 'Paradise Adventist Academy' 'A' 'Villa Monterey Apartments' 'Chico'
 'Camino Apartments' 'dfg' 'Butte Valley' 'Yankee Hill' 'Forest Ranch'
 'Honey Run' 'Shadowbrook Apartments' 'Academy Oaks Apartments'
 'Los Angeles County' 'Malibu' 'Malibu Springs' 'Charmlee Wilderness Park'
 'Agoura Hills' 'Los Virgenes Homeowners Association' 'Calabasas'
 'South Ridge Heights' 'Oak Forest' 'Oak Village' 'Oak'
 'Seminole Springs Mobile Park' 'Seminole Springs' 'Hidden Park Estates'
 'Agoura hills' 'Malibu Lake' 'Saratoga Hills' "Brent's Junction"
 'Point Dume Club of Malibu' 'Oak Park' 'Residential' 'Westlake'
 'Morrison Sutton Valley' 'Bell Canyon' 'Thousand Oaks'
 'Wilshire Boulevard Temple Camps' 'Vallecito Mobile Community'
 'High Knoll' 'Camarillo' 'Ventura Count

<a name="7.2"></a>
### 7.2 - Saving the File.

In [16]:
import os

csv_folder = 'csv_data'
if not os.path.exists(csv_folder):
    os.makedirs(csv_folder)

# Define the path for the CSV file
csv_file_path = os.path.join(csv_folder, 'cali_wildfire_filled_address_df_with_community.csv')

# Write the DataFrame to a CSV file
cali_wildfire_filled_address_df.to_csv(csv_file_path, index=False)

<a name="8"></a>
## 8 - Filling the Communities column based on the nearest neighbor within the same CITY and COUNTY

In [338]:
from sklearn.neighbors import NearestNeighbors

def scale_coordinates(coords):
    """Scale coordinates to approximate kilometers"""
    return coords * np.array([111, 85])  # Approximate km per degree


def assign_communities(df):
    """
    Assigns communities to rows with 'Unknown' in the COMMUNITY column
    based on the nearest known community within the same CITY and COUNTY.

    Args:
        df (pd.DataFrame): DataFrame with columns ['LATITUDE', 'LONGITUDE', 'CITY', 'COUNTY', 'COMMUNITY'].

    Returns:
        pd.DataFrame: Updated DataFrame with 'Unknown' communities filled.
    """
    # Ensure we don't modify the original DataFrame
    df = df.copy()

    # Group the dataset by CITY and COUNTY
    # Process groups by size (larger groups first)
    groups = [(city, county, group) for (city, county), group 
              in df.groupby(['CITY', 'COUNTY'])]
    groups.sort(key=lambda x: len(x[2]), reverse=True)

    for city, county, group in groups:
        try:
            # Skip groups with insufficient data
            if len(group) < 2:
                continue

            # Identify known communities
            known_mask = group['COMMUNITY'] != 'Unknown'
            if not known_mask.any():  # Skip if no known communities
                continue

            known_coords = group[known_mask][['LATITUDE', 'LONGITUDE']].values
            known_coords_scaled = scale_coordinates(known_coords)
            known_communities = group[known_mask]['COMMUNITY'].values

            # Initialize Nearest Neighbors model
            nn = NearestNeighbors(n_neighbors=1)
            nn.fit(known_coords_scaled)

            # Identify unknown communities
            unknown_mask = group['COMMUNITY'] == 'Unknown'
            unknown_coords = group[unknown_mask][['LATITUDE', 'LONGITUDE']].values
            unknown_coords_scaled = scale_coordinates(unknown_coords)

            # Assign communities to unknowns
            if len(unknown_coords) > 0:
                # Find nearest neighbors
                distances, indices = nn.kneighbors(unknown_coords_scaled)

                # Assign nearest community to each unknown row
                for i, idx in enumerate(indices):
                    unknown_index = group[unknown_mask].index[i]
                    df.loc[unknown_index, 'COMMUNITY'] = known_communities[idx[0]]

        except Exception as e:
            print(f"Error processing group {city}, {county}: {str(e)}")
            continue

    return df

In [18]:
%%time
cali_wildfire_filled_address_df_with_community = assign_communities(cali_wildfire_filled_address_df)

CPU times: total: 4.3 s
Wall time: 4.46 s


<a name="8.1"></a>
### 8.1 - Saving the File.

In [19]:
import os

csv_folder = 'csv_data'
if not os.path.exists(csv_folder):
    os.makedirs(csv_folder)

# Define the path for the CSV file
csv_file_path = os.path.join(csv_folder, 'cali_wildfire_filled_address_df_with_community.csv')

# Write the DataFrame to a CSV file
cali_wildfire_filled_address_df_with_community.to_csv(csv_file_path, index=False)

In [20]:
# cali_wildfire_filled_address_df_with_county[cali_wildfire_filled_address_df_with_county['COMMUNITY'] == 'Unknown']

<a name="9"></a>
## 9 - Filling the Topography Column based on nearest neighbour within the same CITY, COUNTY and COMMUNITY

In [21]:
from sklearn.neighbors import NearestNeighbors
import pandas as pd
import numpy as np

def assign_topography(df):
    """
    Assigns topography to rows with 'Unknown' based on nearest known topography
    within the same CITY, COUNTY, and COMMUNITY group.
    """
    df = df.copy()
    
    # Group by CITY, COUNTY, and COMMUNITY
    for (city, county, community), group in df.groupby(['CITY', 'COUNTY', 'COMMUNITY']):
        # Identify known topography points
        known_mask = group['TOPOGRAPHY'] != 'Unknown'
        known_coords = group[known_mask][['LATITUDE', 'LONGITUDE']].values
        known_topography = group[known_mask]['TOPOGRAPHY'].values
        
        # Only proceed if we have known topography in this group
        if len(known_coords) > 0:
            # Initialize Nearest Neighbors model
            nn = NearestNeighbors(n_neighbors=1)
            nn.fit(known_coords)
            
            # Identify unknown topography points
            unknown_mask = group['TOPOGRAPHY'] == 'Unknown'
            unknown_coords = group[unknown_mask][['LATITUDE', 'LONGITUDE']].values
            
            if len(unknown_coords) > 0:
                # Find nearest neighbors
                distances, indices = nn.kneighbors(unknown_coords)
                
                # Assign topography based on nearest neighbor
                for i, idx in enumerate(indices):
                    unknown_index = group[unknown_mask].index[i]
                    df.loc[unknown_index, 'TOPOGRAPHY'] = known_topography[idx[0]]
    
    return df


In [22]:
cali_wildfire_filled_address_df_with_community[cali_wildfire_filled_address_df_with_community['TOPOGRAPHY'] == 'Unknown']

Unnamed: 0,DAMAGE,CITY,CALFIREUNI,COUNTY,COMMUNITY,INCIDENTNA,INCIDENTNU,INCIDENTST,VEGCLEARAN,STRUCTURET,...,EAVES,VENTSCREEN,EXTERIORSI,WINDOWPANE,TOPOGRAPHY,ASSESSEDIM,YEARBUILT,LONGITUDE,LATITUDE,CLAIM
77,Affected (1-9%),Magalia,BTU,Butte,Magalia,Camp,CA-BTU-016737,11/8/2018,0-30',School,...,Enclosed,"Mesh Screen <= 1/8""",Combustible,Multi Pane,Unknown,0.0,1900.0,-121.600069,39.813186,32903.92
78,Affected (1-9%),Magalia,BTU,Butte,Magalia,Camp,CA-BTU-016737,11/8/2018,0-30',School,...,Enclosed,"Mesh Screen <= 1/8""",Combustible,Multi Pane,Unknown,0.0,1900.0,-121.599927,39.813254,38085.36
349,Destroyed (>50%),Paradise Central Southwest A,BTU,Butte,Paradise,Camp,CA-BTU-016737,11/8/2018,0-30',UtilityMiscStructure,...,Unknown,Unknown,Ignition Resistant,No Windows,Unknown,18627.0,1940.0,-121.621008,39.754381,308323.81
350,Destroyed (>50%),Paradise Central Southwest A,BTU,Butte,Paradise,Camp,CA-BTU-016737,11/8/2018,0-30',Single Family Residence Single Story,...,Unknown,"Mesh Screen > 1/8""",Ignition Resistant,Multi Pane,Unknown,18627.0,1940.0,-121.621306,39.754480,849875.49
353,Destroyed (>50%),Paradise Central Southwest A,BTU,Butte,Paradise,Camp,CA-BTU-016737,11/8/2018,0-30',UtilityMiscStructure,...,Unknown,Unknown,Combustible,Unknown,Unknown,41552.0,1948.0,-121.621219,39.755313,695554.63
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31641,Destroyed (>50%),Marysville,BTU,YUB,Loma Rica,Cascade,CANEU026269,10/8/2017,Unknown,Outbuilding gt 10'X12',...,Unknown,Unknown,Unknown,Unknown,Unknown,180000.0,1979.0,-121.411826,39.317140,224037.37
31663,Destroyed (>50%),Marysville,BTU,YUB,Loma Rica,Cascade,CANEU026269,10/8/2017,Unknown,Non-habitable-Detached Garage,...,Unknown,Unknown,Unknown,Unknown,Unknown,135673.0,1989.0,-121.399754,39.335579,344971.17
31665,Destroyed (>50%),Loma Rica,BTU,YUB,Loma Rica,Cascade,CANEU026269,10/8/2017,Unknown,Outbuilding gt 10'X12',...,Unknown,Unknown,Unknown,Unknown,Unknown,74270.0,2002.0,-121.398502,39.336103,195206.49
31682,Destroyed (>50%),,BTU,YUB,Unknown,Laporte,CA-NEU-026295,10/09/2017,Unknown,Outbuilding gt 10'X12',...,Unknown,Unknown,Unknown,Unknown,Unknown,30939.0,0.0,0.000000,0.000000,464593.54


In [23]:
%%time
cali_wildfire_filled_address_df_with_community_topography = assign_topography(cali_wildfire_filled_address_df_with_community)

CPU times: total: 5.67 s
Wall time: 2.83 s


In [24]:
len(cali_wildfire_filled_address_df_with_community_topography['COMMUNITY'].unique())

108

In [25]:
len(cali_wildfire_filled_address_df_with_community['COMMUNITY'].unique())

108

In [26]:
cali_wildfire_filled_address_df_with_community_topography['COMMUNITY'].unique()

array(['Paradise', 'Magalia', 'Oroville', 'Concow',
       'Paradise Treatment Cente', 'Skyway Antique Mall', 'P',
       'Paradise Adventist Academy', 'A', 'Villa Monterey Apartments',
       'Chico', 'Camino Apartments', 'dfg', 'Butte Valley', 'Yankee Hill',
       'Forest Ranch', 'Honey Run', 'Shadowbrook Apartments',
       'Academy Oaks Apartments', 'Los Angeles County', 'Malibu',
       'Malibu Springs', 'Mulholland', 'Charmlee Wilderness Park',
       'Agoura Hills', 'Calabasas', 'Los Virgenes Homeowners Association',
       'South Ridge Heights', 'Oak Forest', 'Oak Village', 'Oak',
       'Seminole Springs Mobile Park', 'Seminole Springs',
       'Hidden Park Estates', 'Agoura hills', 'Malibu Lake',
       'Saratoga Hills', "Brent's Junction", 'Point Dume Club of Malibu',
       'Bell Canyon', 'Oak Park', 'Residential', 'Westlake',
       'Morrison Sutton Valley', 'Thousand Oaks',
       'Wilshire Boulevard Temple Camps', 'Vallecito Mobile Community',
       'High Knoll', 'Cama

In [28]:
cali_wildfire_filled_address_df_with_community_topography[(cali_wildfire_filled_address_df_with_community_topography['TOPOGRAPHY'] == 'Unknown') & 
(cali_wildfire_filled_address_df_with_community_topography['COMMUNITY'] == 'Unknown')]

Unnamed: 0,DAMAGE,CITY,CALFIREUNI,COUNTY,COMMUNITY,INCIDENTNA,INCIDENTNU,INCIDENTST,VEGCLEARAN,STRUCTURET,...,EAVES,VENTSCREEN,EXTERIORSI,WINDOWPANE,TOPOGRAPHY,ASSESSEDIM,YEARBUILT,LONGITUDE,LATITUDE,CLAIM
24479,Destroyed (>50%),Butte Valley,BTU,Butte,Unknown,Cherokee,CA-BTU-015933,10/09/2017,0-30,Non-habitable-Shop,...,Unknown,Unknown,Unknown,Unknown,Unknown,3367.0,1900.0,-121.596946,39.59779,390206.65
25302,Minor (10-25%),Ridge,MEU,Mendocino,Unknown,Redwood,CAMEU 012169,10/08/2017,0-30,Commercial Building - Single Story,...,Not Applicable,Unknown,Fire Resistant,Unknown,Unknown,0.0,0.0,-123.273173,39.322946,84012.85
31682,Destroyed (>50%),,BTU,YUB,Unknown,Laporte,CA-NEU-026295,10/09/2017,Unknown,Outbuilding gt 10'X12',...,Unknown,Unknown,Unknown,Unknown,Unknown,30939.0,0.0,0.0,0.0,464593.54


<a name="9.1"></a>
### 9.1 - Saving Address Filled Data

In [31]:
import os

csv_folder = 'csv_data'
if not os.path.exists(csv_folder):
    os.makedirs(csv_folder)

# Define the path for the CSV file
csv_file_path = os.path.join(csv_folder, 'cali_wildfire_filled_address_df_with_community_topography.csv')

# Write the DataFrame to a CSV file
cali_wildfire_filled_address_df_with_community_topography.to_csv(csv_file_path, index=False)

In [112]:
cali_wildfire_filled_address_df_with_community_topography = pd.read_csv('csv_data/cali_wildfire_filled_address_df_with_community_topography.csv')

In [113]:
cali_wildfire_filled_address_df_with_community_topography.columns

Index(['DAMAGE', 'CITY', 'CALFIREUNI', 'COUNTY', 'COMMUNITY', 'INCIDENTNA',
       'INCIDENTNU', 'INCIDENTST', 'VEGCLEARAN', 'STRUCTURET', 'ROOFCONSTR',
       'EAVES', 'VENTSCREEN', 'EXTERIORSI', 'WINDOWPANE', 'TOPOGRAPHY',
       'ASSESSEDIM', 'YEARBUILT', 'LONGITUDE', 'LATITUDE', 'CLAIM'],
      dtype='object')

<a name="10"></a>
## 10 - Checking the Unknwown Presence in the HOUSE COMPONENTS (STRUCTURET, ROOFCONSTR, EAVES, VENTSCREEN, EXTERIORSI, WINDOWPANE)

In [114]:
# Select specific columns to analyze for 'Unknown' values
columns_to_analyze = ['VEGCLEARAN', 'STRUCTURET', 'EAVES', 'VENTSCREEN', 'EXTERIORSI', 'WINDOWPANE', 'YEARBUILT']

# Count 'Unknown' values in the selected columns
unknown_counts = cali_wildfire_filled_address_df_with_community_topography[columns_to_analyze].apply(lambda col: (col == 'Unknown').sum())


# Create a new DataFrame to display the selected columns and their 'Unknown' value counts
unknown_counts_df = unknown_counts.reset_index()

unknown_counts_df.columns = ['Column', 'Unknown Value Count']
# unknown_unique_counts_df.columns = ['Column', 'Unique Values']


# Display the resulting DataFrame
print(unknown_counts_df)

       Column  Unknown Value Count
0  VEGCLEARAN                 3419
1  STRUCTURET                    2
2       EAVES                13901
3  VENTSCREEN                 9437
4  EXTERIORSI                 1885
5  WINDOWPANE                 6803
6   YEARBUILT                 7295


<a name="10.1"></a>
### 10.1 - Removing the Rows with no coordinate and city

In [115]:
cali_wildfire_filled_address_df_with_community_topography = cali_wildfire_filled_address_df_with_community_topography[~((cali_wildfire_filled_address_df_with_community_topography['LATITUDE'] == 0) &
                                                            (cali_wildfire_filled_address_df_with_community_topography['LATITUDE'] == 0))]
len(cali_wildfire_filled_address_df_with_community_topography)

31682

<a name="10.2"></a>
### 10.2 - Removing rows with Unknown Present in all attribute (STRUCTURET, ROOFCONSTR, EAVES, VENTSCREEN, EXTERIORSI, WINDOWPANE) 

In [116]:
cali_wildfire_dropped_columns[((cali_wildfire_dropped_columns['VEGCLEARAN'] == 'Unknown') &
                               (cali_wildfire_dropped_columns['STRUCTURET'] == 'Unknown') &
                               (cali_wildfire_dropped_columns['EAVES'] == 'Unknown') &
                               (cali_wildfire_dropped_columns['EXTERIORSI'] == 'Unknown') &
                               (cali_wildfire_dropped_columns['WINDOWPANE'] == 'Unknown'))]

Unnamed: 0,DAMAGE,STREETNAME,STREETTYPE,CITY,CALFIREUNI,COUNTY,COMMUNITY,INCIDENTNA,INCIDENTNU,INCIDENTST,...,EXTERIORSI,WINDOWPANE,TOPOGRAPHY,ASSESSEDIM,YEARBUILT,GLOBALID,GEOMETRY,LONGITUDE,LATITUDE,CLAIM
21753,No Damage,Green Valley,Road,Unknown,LNU,NAP,Unknown,Atlas,CALNU010046,10/08/2017,...,Unknown,Unknown,Unknown,272599.0,Unknown,{D9E7BA90-0CC1-4DA9-A57D-ACD87E9447EE},POINT (-122.2188323333769 38.28160459645744),-122.218832,38.281605,0.0
22526,No Damage,Atlas Peak,Unknown,Unknown,LNU,NAP,Unknown,Atlas,CALNU010046,10/08/2017,...,Unknown,Unknown,Unknown,129441.0,1983.0,{901B22A8-F363-45E5-B4C9-559D53A352AA},POINT (-122.2566875155195 38.37321758053697),-122.256688,38.373218,0.0


In [119]:
cali_wildfire_filled_address_df_with_community_topography = cali_wildfire_filled_address_df_with_community_topography[~((cali_wildfire_filled_address_df_with_community_topography['VEGCLEARAN'] == 'Unknown') &
                                                            (cali_wildfire_filled_address_df_with_community_topography['STRUCTURET'] == 'Unknown') &
                                                            (cali_wildfire_filled_address_df_with_community_topography['EAVES'] == 'Unknown') &
                                                            (cali_wildfire_filled_address_df_with_community_topography['EXTERIORSI'] == 'Unknown') &
                                                            (cali_wildfire_filled_address_df_with_community_topography['WINDOWPANE'] == 'Unknown'))]

<a name="10.3"></a>
### 10.3 - Replacing repetitive values in house attributes with the correct ones (EAVES, WINDOWPANE)

In [139]:
cali_wildfire_filled_address_df_with_community_topography['VEGCLEARAN'].unique()

array(["0-30'", 'Unknown', "30-60'", ">100'", "60-100'"], dtype=object)

In [135]:
cali_wildfire_filled_address_df_with_community_topography['EAVES'].unique()

array(['Unknown', 'Enclosed', 'Unenclosed', 'No Eaves', 'Not Applicable'],
      dtype=object)

In [339]:
cali_wildfire_filled_address_df_with_community_topography[cali_wildfire_filled_address_df_with_community_topography['VENTSCREEN'] == 'No']

Unnamed: 0,DAMAGE,CITY,CALFIREUNI,COUNTY,COMMUNITY,INCIDENTNA,INCIDENTNU,INCIDENTST,VEGCLEARAN,STRUCTURET,...,VENTSCREEN,EXTERIORSI,WINDOWPANE,TOPOGRAPHY,ASSESSEDIM,YEARBUILT,LONGITUDE,LATITUDE,CLAIM,STRUCTURET_STANDARDIZED
21556,Minor (10-25%),Napa,LNU,NAP,Napa,Atlas,CALNU010046,10/08/2017,0-30',Single Family Residence-Multi Story,...,No,Combustible,Multi Pane,Slope,921078,1973,-122.249079,38.350261,21929.22,Single Family Residence
21562,Affected (1-9%),Napa,LNU,NAP,Napa,Atlas,CALNU010046,10/08/2017,0-30',Single Family Residence-Multi Story,...,No,Fire Resistant,Multi Pane,Slope,989868,1988,-122.250958,38.351968,56096.34,Single Family Residence
21576,No Damage,Napa,LNU,NAP,Napa,Atlas,CALNU010046,10/08/2017,0-30',Single Family Residence-Single Story,...,No,Fire Resistant,Single Pane,Slope,276501,Unknown,-122.252944,38.349008,0.00,Single Family Residence
21708,Minor (10-25%),Napa,LNU,NAP,Napa,Atlas,CALNU010046,10/08/2017,0-30',Outbuilding gt 10'X12',...,No,Combustible,Unknown,Slope,0,0,-122.255797,38.350923,21199.71,Outbuilding
21720,No Damage,Napa,LNU,NAP,Napa,Atlas,CALNU010046,10/08/2017,0-30',Single Family Residence-Single Story,...,No,Fire Resistant,Single Pane,Flat Ground,477332,Unknown,-122.269456,38.350372,0.00,Single Family Residence
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31400,No Damage,Geyserville,LNU,SON,Pocket,Pocket Fire,CALNU 010104,10/09/2017,Unknown,School,...,No,Combustible,Multi Pane,Flat Ground,242254,Unknown,-122.912103,38.710373,0.00,Public Building
31402,No Damage,Larkfield-Wikiup,LNU,SON,Larkfield-Wikiup,Tubbs Fire,CALNU 010104,10/08/2017,Unknown,School,...,No,Fire Resistant,Multi Pane,Flat Ground,134581,Unknown,-122.759443,38.503141,0.00,Public Building
31404,No Damage,Larkfield-Wikiup,LNU,SON,Larkfield-Wikiup,Tubbs Fire,CALNU 010104,10/08/2017,Unknown,School,...,No,Combustible,Single Pane,Flat Ground,134581,Unknown,-122.759501,38.502051,0.00,Public Building
31540,Destroyed (>50%),Loma Rica,BTU,YUB,Loma Rica,Cascade,CANEU026269,10/8/2017,0-30',Outbuilding gt 10'X12',...,No,Fire Resistant,Single Pane,Flat Ground,2870,0,-121.367822,39.358913,849603.70,Outbuilding


In [133]:
# If this is a DataFrame column
cali_wildfire_filled_address_df_with_community_topography['EAVES'] = cali_wildfire_filled_address_df_with_community_topography['EAVES'].replace('Un-Enclosed', 'Unenclosed')
cali_wildfire_filled_address_df_with_community_topography['WINDOWPANE'] = cali_wildfire_filled_address_df_with_community_topography['WINDOWPANE'].replace('Single', 'Single Pane')

<a name="10.4"></a>
### 10.4 - Mapping VegClearan

In [344]:
# Create a mapping dictionary for replacements
mapping = {
    '0-30': "0-30'",
    "0-30'": "0-30'",
    '30-60': "30-60'",
    '60-100': "60-100'",
    "30-100'": "30-60'",    # Split the range
    '>100': ">100'",
    "100'": ">100'",
    "}100'": ">100'",
    'Unknown': 'Unknown'
}

# Apply the replacement
# cali_wildfire_filled_address_df_with_community_topography['VEGCLEARAN'] = cali_wildfire_filled_address_df_with_community_topography['VEGCLEARAN'].replace(mapping)
cali_wildfire_filled_address_df_with_community_topography.loc[:, 'VEGCLEARAN'] = cali_wildfire_filled_address_df_with_community_topography['VEGCLEARAN'].replace(mapping)


<a name="10.5"></a>
### 10.5 - DROPPING THE Columns where the IncidentTST is UnKnown or Year Built is Zero

In [263]:
cali_wildfire_filled_address_df_with_community_topography = cali_wildfire_filled_address_df_with_community_topography[~(cali_wildfire_filled_address_df_with_community_topography['INCIDENTST'] == 'Unknown')]

<a name="10.6"></a>
### 10.6 - Saving the File

In [265]:
import os

csv_folder = 'csv_data'
if not os.path.exists(csv_folder):
    os.makedirs(csv_folder)

# Define the path for the CSV file
csv_file_path = os.path.join(csv_folder, 'cali_wildfire_filled_address_df_with_community_topography_cleaned.csv')

# Write the DataFrame to a CSV file
cali_wildfire_filled_address_df_with_community_topography.to_csv(csv_file_path, index=False)

In [308]:
cali_wildfire_filled_address_df_with_community_topography['STRUCTURET'].unique()

array(['Mobile Home Double Wide', 'Single Family Residence Single Story',
       'UtilityMiscStructure', 'Mobile Home Single Wide',
       'Commercial Building Single Story',
       'Multi Family Residence Single Story', 'Mobile Home Triple Wide',
       'School', 'Single Family Residence Multi Story', 'Motor Home',
       'Mixed Commercial/Residential', 'Commercial Building Multi Story',
       'Church', 'Multi Family Residence Multi Story', 'Infrastructure',
       'Single Family Residence-Single Story',
       'Single Family Residence-Multi Story', "Outbuilding gt 10'X12'",
       'Non-habitable-Detached Garage',
       'Multi Family Residence - Multi Story',
       'Multi Family Residence - Single Story', 'Non-habitable-Shop',
       'Commercial Building - Multi Story', 'Non-habitable-Barn',
       'Commercial Building - Single Story', 'Miscellaneous',
       'Mobile Home - Single Wide', 'Mobile Home - Motor Home',
       'Mobile Home - Double Wide', 'Mobile Home - Triple Wide',
  

In [317]:
cali_wildfire_filled_address_df_with_community_topography['STRUCTURET'].isna().sum()

0

<a name="10.7"></a>
### 10.7 - Mapping, Standardizing and Consolidating Housing StuctureType (New Column - Standardized_StructerT)

In [321]:
# Identify unique values in `STRUCTURET` column and standardize them for consistency
structure_mapping = {
    'Mobile Home Double Wide': 'Mobile Home',
    'Mobile Home Single Wide': 'Mobile Home',
    'Mobile Home Triple Wide': 'Mobile Home',
    'Mobile Home - Single Wide': 'Mobile Home',
    'Mobile Home - Double Wide': 'Mobile Home',
    'Mobile Home - Triple Wide': 'Mobile Home',
    'Mobile Home - Motor Home': 'Mobile Home',
    'Motor Home': 'Mobile Home',
    'Single Family Residence Single Story': 'Single Family Residence',
    'Single Family Residence Multi Story': 'Single Family Residence',
    'Single Family Residence-Single Story': 'Single Family Residence',
    'Single Family Residence-Multi Story': 'Single Family Residence',
    'Multi Family Residence Single Story': 'Multi Family Residence',
    'Multi Family Residence Multi Story': 'Multi Family Residence',
    'Multi Family Residence - Single Story': 'Multi Family Residence',
    'Multi Family Residence - Multi Story': 'Multi Family Residence',
    'Commercial Building Single Story': 'Commercial Building',
    'Commercial Building Multi Story': 'Commercial Building',
    'Commercial Building - Single Story': 'Commercial Building',
    'Commercial Building - Multi Story': 'Commercial Building',
    'Non-habitable-Detached Garage': 'Non-habitable',
    'Non-habitable-Shop': 'Non-habitable',
    'Non-habitable-Barn': 'Non-habitable',
    "Outbuilding gt 10'X12'": 'Outbuilding',
    'Mixed Commercial/Residential': 'Mixed Use',
    'Miscellaneous': 'Other',
    'Infrastructure': 'Other',
    'UtilityMiscStructure': 'Other',
    'Church': 'Public Building',
    'School': 'Public Building',
    'Hospital': 'Public Building'
}

# Apply the mapping to standardize `STRUCTURET`
cali_wildfire_filled_address_df_with_community_topography.loc[:,'STRUCTURET_STANDARDIZED'] = cali_wildfire_filled_address_df_with_community_topography['STRUCTURET'].replace(structure_mapping)

# Verify the unique values in the updated column
standardized_unique_values = cali_wildfire_filled_address_df_with_community_topography['STRUCTURET_STANDARDIZED'].unique()

standardized_unique_values


array(['Mobile Home', 'Single Family Residence', 'Other',
       'Commercial Building', 'Multi Family Residence', 'Public Building',
       'Mixed Use', 'Outbuilding', 'Non-habitable'], dtype=object)

<a name="11"></a>
## 11 - Saving the Final Cleaned Version File

In [324]:
import os

csv_folder = 'csv_data'
if not os.path.exists(csv_folder):
    os.makedirs(csv_folder)

# Define the path for the CSV file
csv_file_path = os.path.join(csv_folder, 'cali_wildfire_filled_address_df_with_community_topography_cleaned.csv')

# Write the DataFrame to a CSV file
cali_wildfire_filled_address_df_with_community_topography.to_csv(csv_file_path, index=False)

In [272]:
# Select specific columns to analyze for 'Unknown' values
columns_to_analyze = ['VEGCLEARAN', 'STRUCTURET', 'EAVES', 'VENTSCREEN', 'EXTERIORSI', 'ROOFCONSTR','WINDOWPANE', 'YEARBUILT']

# Count 'Unknown' values in the selected columns
unknown_counts = cali_wildfire_filled_address_df_with_community_topography[columns_to_analyze].apply(lambda col: (col == 'Unknown').sum())


# Create a new DataFrame to display the selected columns and their 'Unknown' value counts
unknown_counts_df = unknown_counts.reset_index()

unknown_counts_df.columns = ['Column', 'Unknown Value Count']
# unknown_unique_counts_df.columns = ['Column', 'Unique Values']


# Display the resulting DataFrame
print(unknown_counts_df)

       Column  Unknown Value Count
0  VEGCLEARAN                 3401
1  STRUCTURET                    0
2       EAVES                13885
3  VENTSCREEN                 9420
4  EXTERIORSI                 1870
5  ROOFCONSTR                 1752
6  WINDOWPANE                 6787
7   YEARBUILT                 7294


In [266]:
df = cali_wildfire_filled_address_df_with_community_topography.copy()