## 1. OBJECTIVE
Classify patents into ecosystems based on their geographical origin

<br>

### METRICS
To be defined

## 2. DATA

### RAW DATA
- Subsector definitions: json file containing definitions and keyworkds
- Patent abstracts: parquet file containing patent abstracts
- Raw patents: parquet file containing general patent data

### PROCESSED DATA
- raw&abstract_2018-2020.parquet
- coordinates.parquet

In [3]:
import pandas as pd
df_coordinates = pd.read_parquet('../data/processed/coordinates.parquet')
df_coordinates.head(2)

Unnamed: 0,Ecosystem,latitude,longitude,coordinates
0,,36.06488,120.38042,"36.06488,120.38042"
1,Shenzhen,22.54554,114.0683,"22.54554,114.0683"


## 3. DATA SPLITTING
The data from raw&abstract_2018-2020.parquet was split into
70%: training_data.parquet
30%: test_data.parquet

## 4. DATA EXPLORATION
See notebooks:
- 01_data_preparation_abstracts.ipynb
- 01_data_preparation_raw_patents.ipynb
- 01_data_preparation_merge_raw&abstract.ipynb

## 5. ALGORITHMS
- DBSCAN for clustering the patents into ecosystems

In [5]:
from sklearn.cluster import DBSCAN
import numpy as np

In [9]:
print(df_coordinates[['latitude_rad', 'longitude_rad']].isna().sum())

latitude_rad     1
longitude_rad    1
dtype: int64


In [10]:
df_coordinates.dropna(subset=['latitude_rad', 'longitude_rad'], inplace=True)

In [11]:
# Convert latitude and longitude to radians
df_coordinates['latitude_rad'] = np.radians(df_coordinates['latitude'])
df_coordinates['longitude_rad'] = np.radians(df_coordinates['longitude'])

In [12]:
# Perform DBSCAN clustering
kms_per_radian = 6371.0088
epsilon = 100 / kms_per_radian

In [13]:
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine')
df_coordinates['cluster'] = db.fit_predict(df_coordinates[['latitude_rad', 'longitude_rad']])

In [15]:
# Count the number of unique clusters, excluding noise (cluster ID -1)
num_clusters = df_coordinates['cluster'][df_coordinates['cluster'] != -1].nunique()

# If you want to include noise as a separate "cluster"
num_clusters_including_noise = df_coordinates['cluster'].nunique()

print(f"Number of clusters excluding noise: {num_clusters}")
print(f"Number of clusters including noise: {num_clusters_including_noise}")

Number of clusters excluding noise: 659
Number of clusters including noise: 659


In [17]:
unique_values = df_coordinates['Ecosystem'].unique()
print("Unique values in 'Ecosystem' column:", unique_values)

Unique values in 'Ecosystem' column: [None 'Shenzhen' 'Frankfurt' 'Osaka' 'Wuxi' 'Mumbai' 'Leipzig' 'Beijing'
 'Los Angeles' 'Dallas' 'Amsterdam-Delta' 'Columbus' 'Chicago' 'Montreal'
 'Hanover' 'Philadelphia' 'Minneapolis' 'Houston'
 'villach-austria NOT resolved' 'Toronto' 'Tokyo' 'Seattle' 'Zurich'
 'Silicon Valley' 'London' 'Detroit' 'Tel Aviv' 'Boston' 'Salt Lake-Provo'
 'Philadelphia;New York City' 'Indiana;West Lafayette' 'Research Triangle'
 'Stockholm;Stockholms län' 'San Bernardino' 'Metro Rhein-Ruhr'
 'Jacksonville' 'Orlando' 'New York City' 'Seoul' 'San Diego' 'Reno'
 'Kansas City;Missouri' 'Washington DC' 'Richmond' 'Tulsa' 'Atlanta'
 'Mannheim-Heidelberg' 'Calgary' 'Singapore' 'Stuttgart' 'Sydney' 'Galway'
 'Paris' 'Nanjing' 'Portland' 'richmond-united states NOT resolved'
 'Bavaria;Munich' 'Waterloo' 'Ottawa' 'Lyon'
 'lafayette-united states NOT resolved' 'Hyderabad' 'Denver-Boulder'
 'Bristol' 'St. Louis' 'Hamburg State;Hamburg'
 'Indiana Center;West Lafayette;Indiana' 

In [18]:
value_counts = df_coordinates['Ecosystem'].value_counts()
print("Count of each unique value in 'Ecosystem' column:")
print(value_counts)

Count of each unique value in 'Ecosystem' column:
Ecosystem
New York City                      458
Stuttgart                          436
Metro Rhein-Ruhr                   382
Bavaria;Munich                     335
Zurich                             302
                                  ... 
salem-germany NOT resolved           1
annecy-france NOT resolved           1
Skopje                               1
lugano-switzerland NOT resolved      1
merdingen-germany NOT resolved       1
Name: count, Length: 650, dtype: int64
