# **Indian Startup Funding** [![Static Badge](https://img.shields.io/badge/Open%20in%20Colab%20-%20orange?style=plastic&logo=googlecolab&labelColor=grey)](https://colab.research.google.com/github/sshrizvi/DS-Python/blob/main/Pandas/CaseStudies/IndianStartupFunding/Notebooks/data_preprocessing.ipynb)



### 📦 **Importing Relevant Libraries**

In [1]:
import pandas as pd
import re
from rapidfuzz import fuzz, process

ModuleNotFoundError: No module named 'rapidfuzz'

### ⚠️ **Data Warning**
The data is in the [Data](../Data/indian_startup_funding.csv) folder.

#### **Reading Data into DataFrames**

In [61]:
funding_df = pd.read_csv(
    filepath_or_buffer = '../Data/indian_startup_funding.csv'
)

### ⚠️ **Constant Warning**
Update the constants according to current conditions.

In [62]:
DOLLAR_RATE = 86.06

### **⚙️ Data Preprocessing**

In [63]:
funding_df.head()

Unnamed: 0,Date,Startup Name,Industry,Sub-vertical,Location,Investors,Investment Type,Amount in USD,Website URL
0,2021-04-14,Swiggy,Online Food Delivery,Online Food Delivery,Bengaluru,"Amansa Holdings, Carmignac, Falcon Edge Capita...",Series J,343000000.0,https://www.swiggy.com/
1,2021-04-14,Beldara,E-commerce,Global B2B marketplace,Mumbai,Hindustan Media Ventures,Venture,7400000.0,https://beldara.com/
2,2021-04-07,Groww,FinTech,Investment platform,Bengaluru,"MC Global Edtech, B Capital, Baron, others",Series D,83000000.0,https://groww.in/
3,2021-04-05,Meesho,E-commerce,Online reselling platform,Bengaluru,SoftBank Vision Fund 2,Series E,300000000.0,http://www.meesho.com/
4,2021-04-01,BYJU’S,Edu-tech,Online tutoring,Bengaluru,Innoven Capital,Series F,460000000.0,http://www.byjus.com/


In [64]:
funding_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3212 entries, 0 to 3211
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Date             3210 non-null   object 
 1   Startup Name     3212 non-null   object 
 2   Industry         2276 non-null   object 
 3   Sub-vertical     3041 non-null   object 
 4   Location         3032 non-null   object 
 5   Investors        3177 non-null   object 
 6   Investment Type  3205 non-null   object 
 7   Amount in USD    2222 non-null   float64
 8   Website URL      2670 non-null   object 
dtypes: float64(1), object(8)
memory usage: 226.0+ KB


#### **Preprocessing Tasks**

1. Rename columns to something convenient.  
   
    ```bash
        'Startup Name' -> 'Startup',
        'Sub-vertical' -> 'SubVertical',
        'Investment Type' -> 'Round',
        'Amount in USD' -> 'AmountInCrores',
        'Website URL' -> 'URL'
    ```

2. Convert datatype of column *Date* from `object` to `datetime`.  

    **Reason :** `datetime` gives us more flexibilty while performing analysis on the basis of months and years.

3. Convert Amount from *USD* to *INR* and then in *Crores*.
4. Process *Startup* column to remove ambiguity in startup names.
5. Process *Investors* column so that you can extract all individual investors.
6. Prepare a new column *Year* for better analysis.

##### **1. Renaming Columns**

In [65]:
funding_df.rename(
    columns = {
        'Startup Name' : 'Startup',
        'Sub-vertical' : 'SubVertical',
        'Investment Type' : 'Round',
        'Amount in USD' : 'AmountInCrores',
        'Website URL' : 'URL'
    },
    inplace = True
)

##### **2. Converting Data Type of Column *Date***

In [66]:
funding_df.Date = pd.to_datetime(
    arg = funding_df.Date
)

##### **3. Converting *Amount* from USD to INR (Crores)**

In [67]:
funding_df.AmountInCrores = (funding_df.AmountInCrores * DOLLAR_RATE) / 10000000

##### **4. Processing *Startup* Column**

In [68]:
def clean_startup_name(name):
    """
    For a given name, it cleans and normalize it.
    """

    name = str(name).strip().lower()
    name = re.sub(r'[^\w\s]', '', name)
    name = re.sub(r'\s+', ' ', name)

    for suffix in ['inc', 'llc', 'ltd', 'co', 'corp']:
        name = re.sub(r'\b' + suffix + r'\b', '', name)

    return name.strip()

In [69]:
def get_best_match(name, unique_names, threshold=90):
    """
    For a given name, find the best match from the list of unique names.
    If the similarity score exceeds the threshold, return the matching name;
    otherwise, return the original name.
    """

    match = process.extractOne(name, unique_names, scorer=fuzz.token_sort_ratio)
    
    if match and match[1] >= threshold:
        return match[0]
    else:
        return name

In [70]:
funding_df['CleanedStartup'] = funding_df['Startup'].apply(clean_startup_name)
unique_clean_names = list(funding_df['CleanedStartup'].unique())

funding_df['StandardizedStartup'] = funding_df['CleanedStartup'].apply(
    lambda x: get_best_match(x, unique_clean_names)
)

##### **5. Process *Investors* Column**

In [71]:
def clean_investor_name(name):
    '''
    It cleans and normalize Investor Names
    '''
    
    name = str(name).strip().lower()
    name = re.sub(r'[^\w\s]', '', name)
    name = re.sub(r'\s+', ' ', name)

    return name.strip()

In [72]:
funding_df['Investors'] = funding_df['Investors'].fillna('')

funding_df['InvestorsCleaned'] = funding_df['Investors'].apply(
    func = lambda x: [clean_investor_name(inv) for inv in x.split(',')] if x else []
)

##### **6. Adding *Year* Column to File**

In [73]:
funding_df['Year'] = funding_df.Date.dt.year

### **📄 Exporting Processed Data**

In [74]:
funding_df.to_csv(
    path_or_buf = '../Data/processed_indian_startup_funding.csv',
    index = False
)

### **📄 Exporting Investors List**

In [75]:
pd.Series(
    data = sorted(set(funding_df.InvestorsCleaned.sum()))
).to_csv(
    path_or_buf = '../Data/investors.csv',
    index = False
)

In [76]:
inv_mask = funding_df.Investors.str.lower().str.contains('softbank', na=False)
filtered_df = funding_df[inv_mask]