# <center><font color='#D0533E'><b>Startup Genome</b></font> - <font color='#0D0D2F'><b>Grupo 3</b></font></center>

<center><font color='#02231c'><b>Módulo 1</font></b></center>
<p><center>
<table>
  <tr>
    <th><b>Nome</b></th>
    <th><b>Contato</b></th>
  </tr>
  <tr>
    <td>Fulano1</td>
    <td><a href="https://www.linkedin.com/in/leo-koki-shashiki/">Linkedin</a></td>
  </tr>
  <tr>
    <td>Rafael Costa</td>
    <td><a href="https://www.linkedin.com/in/rafael-costa-a642752b/">Linkedin</a></td>
  </tr>
  <tr>
    <td>Thiago Seronni Mendonça</td>
    <td><a href="https://www.linkedin.com/in/thiagoseronni/">Linkedin</a></td>
  </tr>
  <tr>
    <td>Leo Koki Shashiki</td>
    <td><a href="https://www.linkedin.com/in/leo-koki-shashiki/">Linkedin</a></td>
  </tr>
</font></table>

Startup Genome’s Global Startup Ecosystem Report (GSER) is powered by the world’s most comprehensive and quality controlled dataset on startup ecosystems. Informed by information on 3.5 million startups across 290 global ecosystems, our data and insights are the product of over a decade of independent research and policy work.

GSER 2023 ranks the top 30 and 10 runner-up global ecosystems, and includes a top 100 ranking of emerging ecosystems. It also takes a look at startup communities from a regional perspective, separately ranking ecosystems in Africa, Asia, Europe, Latin America, MENA, North America, and Oceania.

Referências: https://www.ibm.com/docs/en/spss-modeler/saas?topic=data-writing-description-report

Miro:   https://miro.com/app/board/uXjVMC_icTM=/ \
Trello: https://trello.com/b/bjkRHTjw/sirius-proj-1 \
GitHub: https://github.com/joaomorossini/startup_genome

# BIBLIOTECAS UTILIZADAS

In [1]:
import pandas as pd

In [2]:
from src.utils.UsefulPaths import Paths
paths = Paths()

# Importando Datasets

In [3]:
df_abstract = pd.read_parquet(paths.raw_parquet_abstract)
df_abstract.head(2)

KeyboardInterrupt: 

In [None]:
df_ipc = pd.read_excel(paths.raw_ipc_titles)
df_ipc.head(2)

In [None]:
df_list_of_companies = pd.read_csv(paths.raw_list_of_companies)
df_list_of_companies.head(2)

In [None]:
df_raw_patents = pd.read_parquet(paths.raw_parquet_raw_patents)
df_raw_patents.head(2)

In [None]:
df_table_for_applicants = pd.read_csv(filepath_or_buffer=paths.raw_table_for_applicants, header=None, names=['patent_id','owner','city_region','country','year'])
df_table_for_applicants.head(10)

# Data Understanding

## 1 - What is the format of the data?

The data is in CSV and XLSX format, but the files larger than 1GB were converted to Parquet. This improved reading performance and reduced storage space usage.

## 2 Identify the method used to capture the data--for example, ODBC.

The data provided was in csv and xlsx extension and we are using pandas library to read it.


## 2 - How large is the database (in numbers of rows and columns)?

In [None]:
print(f'Dataset abstract.parquet         |  Row: {df_abstract.shape[0]}\t Columns: {df_abstract.shape[1]}')
print(f'Dataset IPC Titles.xlsx          |  Row: {df_ipc.shape[0]}\t\t Columns: {df_ipc.shape[1]}')
print(f'Dataset ListOfCompanies.csv      |  Row: {df_list_of_companies.shape[0]}\t Columns: {df_list_of_companies.shape[1]}')
print(f'Dataset raw_patents.parquet      |  Row: {df_raw_patents.shape[0]}\t Columns: {df_raw_patents.shape[1]}')
print(f'Dataset table_for_applicants.csv |  Row: {df_table_for_applicants.shape[0]}\t Columns: {df_table_for_applicants.shape[1]}')

## 3 - Data Quality

In [None]:
list_df = [df_abstract, df_ipc, df_list_of_companies, df_raw_patents, df_table_for_applicants]
dict_df = {
    'abst': 'Abstract',
    'ipc': 'IPC Titles',
    'comp': 'List of Companies',
    'pat': 'Raw Patents',
    'app': 'Table for Applicants'
}

### Análise de Nulos

In [None]:
null_list = []
for df, values in zip(list_df, dict_df.values()):
    df_null_percentage = pd.DataFrame(df.isnull().sum()* 100 / len(df)).reset_index()
    df_null_percentage.rename(columns={'index': values, 0:'null_percentage'}, inplace = True)

    df_null_percentage.null_percentage = df_null_percentage.null_percentage.round(1)
    null_list.append(df_null_percentage)
    print('#'*100 + '\n')
    print(values+ '\n')
    display(df_null_percentage.sort_values(by = 'null_percentage', ascending=False))
    print('_'*100 + '\n')

### Análise de Duplicados

In [None]:
duplicated_list = []
for df, values in zip(list_df, dict_df.values()):
    df_duplicated = pd.DataFrame(df.duplicated().value_counts()).reset_index()
    df_duplicated.rename(columns={'index': 'duplicated', 0: values}, inplace = True)
    duplicated_list.append(df_duplicated)

    print('#'*100 + '\n')
    print(values+ '\n')
    display(df_duplicated)
    print('_'*100 + '\n')

## 4 - Does the data include characteristics relevant to the business question?

Apparently they have all data needed to answer the questions


## 5 - What data types are present (symbolic, numeric, etc.)?

In [None]:
list_dtypes = []
for df, values in zip(list_df, dict_df.values()):

    df_dtypes = pd.DataFrame(df.dtypes).reset_index()
    df_dtypes.rename(columns={'index': values, 0:'data_type'}, inplace=True )
    list_dtypes.append(df_dtypes)

    print('#'*100 + '\n')
    print(values+ '\n')
    display(df_dtypes)
    print('_'*100 + '\n')

In [None]:
unique_patents_abstract = len(df_abstract['publication_number'].unique())
print(f'Unique patents for abstract: {unique_patents_abstract}')

unique_patents_raw_patents = len(df_raw_patents['patent_id'].unique())
print(f'Unique patents for raw_patents: {unique_patents_raw_patents}')

unique_patents_applicants = len(df_table_for_applicants['patent_id'].unique())
print(f'Unique patents for applicants: {unique_patents_applicants}')

In [None]:
# df_abstract[~df_abstract['publication_number'].isin(df_raw_patents['patent_id'])]
# df_abstract

## 6 - Did you compute basic statistics for the key attributes? What insight did this provide into the business question?

## 7 - Are you able to prioritize relevant attributes? If not, are business analysts available to provide further insight?

In [None]:
##

## 8 - What sort of hypotheses have you formed about the data?

## 9 - Which attributes seem promising for further analysis?

## 10 - Have your explorations revealed new characteristics about the data?

## 11 - How have these explorations changed your initial hypothesis?

## 12 - Can you identify particular subsets of data for later use?

## 13 - Take another look at your data mining goals. Has this exploration altered the goals?