<a href="https://colab.research.google.com/github/suleiman-code/AI-Resume-Matcher/blob/main/Data_Assembling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Collect Text Data from Local Folder (.txt files)**

In [37]:
# ===============================================
# 1. Collect Text Data from Local Folder (.txt files)
# ===============================================

import os
import pandas as pd
from google.colab import files  # For uploading files in Colab

# ✅ Step 1: Upload multiple .txt files
# You can select multiple files when the upload window opens
uploaded = files.upload()

# ✅ Step 2: Create a folder to store uploaded text files
os.makedirs("text_data", exist_ok=True)

# Save uploaded files into the "text_data" folder
for filename, file_content in uploaded.items():
    with open(os.path.join("text_data", filename), "wb") as f:
        f.write(file_content)

# ✅ Step 3: Function to read all .txt files from the folder
def collect_local_text(folder_path="text_data"):
    collected_texts = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):  # Only process .txt files
            filepath = os.path.join(folder_path, filename)
            with open(filepath, "r", encoding="utf-8") as file:
                text = file.read()  # Read entire file
                collected_texts.append({"source": filename, "text": text})

                # Print each file content clearly
                print(f"\n📄 File: {filename}")
                print("-------------------------------------------------")
                print(text)
                print("-------------------------------------------------")
    return pd.DataFrame(collected_texts)

# ✅ Step 4: Collect all uploaded text files into a DataFrame
local_text_df = collect_local_text()

print("\n📂 Local Text DataFrame:\n", local_text_df)


Saving data Assebling1.txt to data Assebling1.txt
Saving data Assebling.txt to data Assebling.txt

📄 File: data Assebling.txt
-------------------------------------------------
﻿Overview of Data Assembling
Data assembling primarily refers to gathering diverse data forms, especially unstructured text data, and converting it into structured formats ready for analysis. This process includes steps such as data collection, cleaning, integration, transformation, and storage. Proper data assembling is critical because quality input data directly impacts the accuracy and usability of analytic outcomes.
Techniques and Methods in Data Assembling for Text Analytics
1. Text Preprocessing
Text preprocessing is essential to handle raw text which is often noisy. It includes:
   * Tokenization: Splitting text into words or phrases
   * Stop word removal: Eliminating common but uninformative words (e.g., "the", "and")
   * Stemming and Lemmatization: Reducing words to their root forms
   * Removing punc

In [None]:
# # ===============================================
# # Scraping BBC News Headlines
# # ===============================================

# import requests
# from bs4 import BeautifulSoup
# import pandas as pd

# # BBC News URL
# url = "https://www.bbc.com/news"

# # Send HTTP request
# response = requests.get(url)
# soup = BeautifulSoup(response.text, "html.parser")

# # Find headline sections
# headlines = soup.find_all("h2")

# # Collect news data
# news_data = []
# for headline in headlines:
#     text = headline.get_text(strip=True)
#     link = headline.find_parent("a")["href"] if headline.find_parent("a") else None
#     if link and not link.startswith("http"):  # Fix relative links
#         link = "https://www.bbc.com" + link
#     news_data.append({"headline": text, "link": link})

# # Convert to DataFrame
# news_df = pd.DataFrame(news_data).dropna()
# print(news_df.head(20))


# **Scraping BBC News Headlines + Full Article Text**

In [38]:
# ===============================================
# Scraping BBC News Headlines + Full Article Text
# ===============================================

import requests
from bs4 import BeautifulSoup
import pandas as pd

# BBC News URL
url = "https://www.bbc.com/news"

# Step 1: Request homepage
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Step 2: Collect headlines + links
news_data = []
headlines = soup.find_all("h2")

for headline in headlines:
    text = headline.get_text(strip=True)
    link = headline.find_parent("a")["href"] if headline.find_parent("a") else None
    if link and not link.startswith("http"):  # Fix relative links
        link = "https://www.bbc.com" + link

    if link:
        # Step 3: Request each article page
        try:
            article_resp = requests.get(link, timeout=10)
            article_soup = BeautifulSoup(article_resp.text, "html.parser")

            # Step 4: Extract text from <p> tags
            paragraphs = [p.get_text(strip=True) for p in article_soup.find_all("p")]
            article_text = " ".join(paragraphs)

        except Exception as e:
            article_text = f"Error fetching article: {e}"
              # Step 5: Store data
        news_data.append({
            "headline": text,
            "link": link,
            "article_text": article_text
        })

In [None]:
# Step 6: Convert to DataFrame
news_df = pd.DataFrame(news_data).dropna()
print(news_df.head(3))  # Show first 3 rows

                                            headline  \
0  UK warns Israel not to retaliate against Pales...   
1  Trump hails Charlie Kirk as 'American hero' as...   
2  Outdoor brand Arc'teryx apologises for firewor...   

                                             link  \
0  https://www.bbc.com/news/articles/c1wggrdn9dno   
1  https://www.bbc.com/news/articles/ckgee0x9p40o   
2  https://www.bbc.com/news/articles/c1eddw19042o   

                                        article_text  
0  Foreign Secretary Yvette Cooper says she has w...  
1  US President Donald Trump hailed the conservat...  
2  Chinese officials are investigating outdoor cl...  


# **provide Link and get Article**

In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the BBC article to scrape
url = "https://www.bbc.com/news/articles/c1eddw19042o"

# Send an HTTP GET request to the article page
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, "html.parser")

    # Extract the article headline (usually in <h1>)
    headline = soup.find('h1').get_text(strip=True) if soup.find('h1') else 'No headline found'

    # Extract paragraphs from the article body - BBC often uses data-component="text-block"
    paragraphs = soup.find_all(attrs={"data-component": "text-block"})

    # Combine all paragraph texts into one string
    article_text = "\n".join([p.get_text(strip=True) for p in paragraphs])

    # Print extracted headline and article text
    print("Headline:")
    print(headline)
    print("\nArticle Text:")
    print(article_text)
else:
    print(f"Failed to retrieve page, status code: {response.status_code}")


Headline:
Outdoor brand Arc'teryx apologises for fireworks display in Tibet

Article Text:
Chinese officials are investigating outdoor clothing brand Arc'teryx after it apologised for a fireworks display in the Himalayan region of Tibet, which drew backlash for its potential impact on the fragile ecosystem.Videos from the 19 September event show multi-coloured fireworks erupting across foothills in a display designed by Chinese artist Cai Guo Qiang as part of a promotional campaign.But the show sparked a barrage of criticism online, with people saying the stunt contradicts Arc'teryx's image as a conservation-focused brand and calling for a boycott of its clothing line.The Canadian firm apologised for the display, saying it was "out of line with Arc'teryx's values".
The firm said that it will work with an external agency to assess the project's impact, adding that it had used entirely biodegradable materials. Arc'teryx also said that the spectacle was aimed at raising awareness of mount

# **Fetch Emotional & Deep Quotes using Quotable API**

In [None]:
!pip install certifi --upgrade




In [39]:
import requests
import pandas as pd

url = "https://zenquotes.io/api/quotes"

response = requests.get(url)

if response.status_code == 200:
    data = response.json()
    quotes_data = []
    for q in data[:20]:  # first 20 quotes
        quotes_data.append({
            "content": q["q"],  # quote text
            "author": q["a"]    # author
        })
else:
    quotes_data = []
    print("Error:", response.status_code)

quotes_df = pd.DataFrame(quotes_data)
print(quotes_df.head(10))


                                             content                   author
0  You don't have to control your thoughts; you j...              Dan Millman
1     Enjoy life. There's plenty of time to be dead.  Hans Christian Andersen
2             All great truths begin as blasphemies.      George Bernard Shaw
3  The worst part of success is trying to find so...             Bette Midler
4  Having lots of money while not having inner pe...    Paramahansa Yogananda
5                    Common sense is not so common.                  Voltaire
6               We forge the chains we wear in life.          Charles Dickens
7    Nothing external to you has any power over you.      Ralph Waldo Emerson
8  So we beat on, boats against the current, born...      F. Scott Fitzgerald
9  The more you know, the more you know you don't...                Aristotle


# **Combine All Sources into ONE DataFrame**

In [41]:
# ===============================================
# 4. Combine All Sources into ONE DataFrame
# ===============================================
combined_df = pd.concat([local_text_df, news_df, quotes_df], ignore_index=True)

# ✅ Show first 30 assembled texts
pd.set_option("display.max_colwidth", None)  # show full text
print(combined_df.head(30))



                 source  \
0    data Assebling.txt   
1   data Assebling1.txt   
2                   NaN   
3                   NaN   
4                   NaN   
5                   NaN   
6                   NaN   
7                   NaN   
8                   NaN   
9                   NaN   
10                  NaN   
11                  NaN   
12                  NaN   
13                  NaN   
14                  NaN   
15                  NaN   
16                  NaN   
17                  NaN   
18                  NaN   
19                  NaN   
20                  NaN   
21                  NaN   
22                  NaN   
23                  NaN   
24                  NaN   
25                  NaN   
26                  NaN   
27                  NaN   
28                  NaN   
29                  NaN   

                                                                                                                                                                  

# **Save Assembled Data**

In [42]:
# ===============================================
# 5. Save Assembled Data
# ===============================================
combined_df.to_csv("assembled_text_data.csv", index=False, encoding="utf-8")
print("✅ All data saved to assembled_text_data.csv")

✅ All data saved to assembled_text_data.csv
