**Abhishek Choudhary**

Imports:

instaloader: This is the library used to scrape data from Instagram profiles.

pandas: Used for handling and manipulating CSV data.

time: Used to add a delay between requests to avoid rate-limiting issues.

re: A regular expression module used to identify and remove URLs and email addresses.

logging: Used for logging errors encountered during scraping.

datetime: Used to append a timestamp to the output CSV filename.

In [1]:
import instaloader
import pandas as pd
import time
import re
import logging
import datetime

Instaloader Initialization:

We create an instance of the Instaloader class, which is responsible for downloading Instagram profile data.



In [2]:
L = instaloader.Instaloader()

Logging Setup:

We configure logging to write error messages to the scraper_errors.log file. This will help track any issues encountered during scraping, such as private profiles or invalid data.

Loading Usernames from CSV:

We load the CSV file containing Instagram usernames using pandas.read_csv().

We then extract the usernames from the "user profile" column and convert them into a list for processing.



In [3]:
logging.basicConfig(filename='scraper_errors.log', level=logging.ERROR)

In [None]:
df_input = pd.read_csv('C:/Users/hp/Desktop/PW assesment/instagram_profiles_sample - Instagram.csv')  # Replace with the path to your CSV file
usernames = df_input['User Profile'].tolist()

Regex Patterns:

url_pattern: This regex is used to match any URLs in the Instagram bio (both HTTP and HTTPS).

email_pattern: This regex matches valid email addresses in the bio.

In [None]:

url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

In [7]:
profile_data = []

### **Instagram Profile Scraping**

This code scrapes Instagram profiles by iterating over a list of usernames:
1. **Fetch Profile**: Extracts profile details using Instaloader.
2. **Clean Bio**: Removes URLs from the bio and checks for an email address.
3. **Store Data**: Saves data (username, bio, followers, email) into a dictionary.
4. **Error Handling**: Logs errors for private profiles or failed requests and marks them as "Error".
5. **Rate Limiting**: Adds a 2-second delay to avoid rate-limiting issues.

The data is then appended to a list for further processing.


In [None]:
for username in usernames:
    try:
        
        profile = instaloader.Profile.from_username(L.context, username)
        
        bio = profile.biography

        clean_bio = re.sub(url_pattern, '', bio)

        email = "no mail found"
    
        email_match = re.findall(email_pattern, clean_bio)
        if email_match:
            email = email_match[0]  # Take the first email found

        
        data = {
            'Username': profile.username,
            'Bio': clean_bio,
            'Followers': profile.followers,
            'Following': profile.followees,
            'Post Count': profile.mediacount,
            'Email': email,
            'Status': 'Success'
        }
        profile_data.append(data)
        print(f"Successfully scraped data for {username}")

    except instaloader.exceptions.InstaloaderException as e:
        
        print(f"Error scraping {username}: {e}")
        logging.error(f"Error scraping {username}: {e}")
        profile_data.append({'Username': username, 'Bio': 'Error', 'Followers': 'Error', 'Following': 'Error', 'Post Count': 'Error', 'Email': 'Error', 'Status': 'Error'})

    except Exception as e:
        
        print(f"Unexpected error with {username}: {e}")
        logging.error(f"Unexpected error with {username}: {e}")
        profile_data.append({'Username': username, 'Bio': 'Error', 'Followers': 'Error', 'Following': 'Error', 'Post Count': 'Error', 'Email': 'Error', 'Status': 'Error'})
    
    # Add delay to prevent rate-limiting issues
    time.sleep(2) 

Successfully scraped data for virat.kohli
Successfully scraped data for bhuvan.bam22
Successfully scraped data for rashmika_mandanna
Successfully scraped data for dqsalmaan
Successfully scraped data for deepikapadukone
Successfully scraped data for ranveersingh
Successfully scraped data for aliaabhatt
Successfully scraped data for akshaykumar
Successfully scraped data for katrinakaif
Successfully scraped data for shraddhakapoor
Successfully scraped data for anushkasharma
Successfully scraped data for sidmalhotra
Successfully scraped data for kartikaaryan
Successfully scraped data for priyankachopra
Successfully scraped data for jacquelienefernandez
Successfully scraped data for sonamkapoor
Successfully scraped data for parineetichopra
Successfully scraped data for taapsee
Successfully scraped data for sushmitasen47
Successfully scraped data for thejohnabraham
Successfully scraped data for krystledsouza
Successfully scraped data for dishapatani
Successfully scraped data for athiyashetty

Convert to DataFrame:

We convert the profile_data list (which contains all the scraped data) into a pandas DataFrame for easy manipulation and saving.

In [9]:
df = pd.DataFrame(profile_data)

In [12]:
df.head(5)

Unnamed: 0,Username,Bio,Followers,Following,Post Count,Email,Status
0,virat.kohli,Carpediem!,271677662,275,1027,no mail found,Success
1,bhuvan.bam22,New BB Ki Vines Episode⬇️,20564992,278,1702,no mail found,Success
2,rashmika_mandanna,Kindness before all 🌻🧡✨,46043119,327,763,no mail found,Success
3,dqsalmaan,"Film, Business, Auto Enthusiast",15101211,972,765,no mail found,Success
4,deepikapadukone,Feed.Burp.Sleep.Repeat.,80285648,214,624,no mail found,Success


In [None]:

timestamp = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
csv_filename = f'instagram_profiles112_{timestamp}.csv'

In [11]:
df.to_csv(csv_filename, index=False)
print(f"Data saved to '{csv_filename}'")

Data saved to 'instagram_profiles112_2025-05-05_02-47-07.csv'


Save Data to CSV:

We save the DataFrame to a CSV file with a timestamp in the filename to prevent overwriting previous files.

index=False ensures that row indices are not saved in the CSV file.

