# 🧪 Inspect PatentsView File: `patent_assignee.tsv`
This notebook loads and inspects the structure of the `patent_assignee.tsv` file to help diagnose column names and values.

In [2]:
import pandas as pd
import os
os.chdir("/Users/soheilkhodadadi/Documents/Projects/semantic-patterns")
# Path to the file to inspect
file_path = "data/raw/patents/patentsview/patent_assignee.tsv"

# Read the first few rows
df = pd.read_csv(file_path, sep="\t", nrows=5)
print("✅ Columns in file:")
print(df.columns.tolist())
df.head()

✅ Columns in file:
['patent_id', 'assignee_sequence', 'assignee_id', 'disambig_assignee_individual_name_first', 'disambig_assignee_individual_name_last', 'disambig_assignee_organization', 'assignee_type', 'location_id']


Unnamed: 0,patent_id,assignee_sequence,assignee_id,disambig_assignee_individual_name_first,disambig_assignee_individual_name_last,disambig_assignee_organization,assignee_type,location_id
0,4488683,0,b12aba35-6fdd-4346-b7c0-8c7a157c8844,,,Metal Works Ramat David,3,50dc5d46-16c8-11ed-9b5f-1234bde3cd05
1,11872626,0,ee0744f6-d2d4-46f4-8be7-2758e096a6a9,,,"DIVERGENT TECHNOLOGIES, INC.",2,15c69712-16c8-11ed-9b5f-1234bde3cd05
2,5856666,0,79f3a1f8-bf1c-41d5-ba18-4f8e7041ebe9,,,U.S. Philips Corporation,2,92237ca2-16c8-11ed-9b5f-1234bde3cd05
3,5204210,0,a950758a-0188-4b59-8d8f-2c1a23d0d201,,,Xerox Corporation,2,0cd1998f-16c8-11ed-9b5f-1234bde3cd05
4,5302149,1,75d7db10-e04f-434c-afb4-c72688ea244b,,,COMMONWEALTH SCIENTIFIC AND INDUSTRIAL RESEARC...,7,4d36742f-16c8-11ed-9b5f-1234bde3cd05


## 📊 Column Overview and Null Check

In [3]:
# View column types and null counts
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 8 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   patent_id                                5 non-null      int64  
 1   assignee_sequence                        5 non-null      int64  
 2   assignee_id                              5 non-null      object 
 3   disambig_assignee_individual_name_first  0 non-null      float64
 4   disambig_assignee_individual_name_last   0 non-null      float64
 5   disambig_assignee_organization           5 non-null      object 
 6   assignee_type                            5 non-null      int64  
 7   location_id                              5 non-null      object 
dtypes: float64(2), int64(3), object(3)
memory usage: 448.0+ bytes


patent_id                                  0
assignee_sequence                          0
assignee_id                                0
disambig_assignee_individual_name_first    5
disambig_assignee_individual_name_last     5
disambig_assignee_organization             0
assignee_type                              0
location_id                                0
dtype: int64

## 🔍 Preview Sample Values by Column

In [4]:
# Print sample values for each column
for col in df.columns:
    print(f"\n🔍 Sample values for column: {col}")
    print(df[col].dropna().unique()[:5])


🔍 Sample values for column: patent_id
[ 4488683 11872626  5856666  5204210  5302149]

🔍 Sample values for column: assignee_sequence
[0 1]

🔍 Sample values for column: assignee_id
['b12aba35-6fdd-4346-b7c0-8c7a157c8844'
 'ee0744f6-d2d4-46f4-8be7-2758e096a6a9'
 '79f3a1f8-bf1c-41d5-ba18-4f8e7041ebe9'
 'a950758a-0188-4b59-8d8f-2c1a23d0d201'
 '75d7db10-e04f-434c-afb4-c72688ea244b']

🔍 Sample values for column: disambig_assignee_individual_name_first
[]

🔍 Sample values for column: disambig_assignee_individual_name_last
[]

🔍 Sample values for column: disambig_assignee_organization
['Metal Works Ramat David' 'DIVERGENT TECHNOLOGIES, INC.'
 'U.S. Philips Corporation' 'Xerox Corporation'
 'COMMONWEALTH SCIENTIFIC AND INDUSTRIAL RESEARCH ORGANISATION']

🔍 Sample values for column: assignee_type
[3 2 7]

🔍 Sample values for column: location_id
['50dc5d46-16c8-11ed-9b5f-1234bde3cd05'
 '15c69712-16c8-11ed-9b5f-1234bde3cd05'
 '92237ca2-16c8-11ed-9b5f-1234bde3cd05'
 '0cd1998f-16c8-11ed-9b5f-1234bde

## ✅ Next Steps
Use the output of this notebook to:
- Confirm column names
- Choose correct ones for `usecols` in your script
- Optionally export cleaned sample rows if needed.