# Acquiring the Data
<div class="alert alert-block alert-danger">
<b>Disclaimer:</b>

The provided scripts are raw and exploratory in nature. The research questions and objectives were iteratively refined throughout the project as data availability improved and our understanding of the dataset deepened.
</div>

---
## Data Acquisition and Construction of Attention Indexes

This study begins by constructing six thematic Attention Indexes using Google Trends data, each reflecting a distinct retail investor focus in the Taiwanese market: ETFs, individual stocks, dividends, macro-sensitive sectors, technology stocks, and beginner-friendly investments. For each theme, multiple related keywords were selected and queried using the pytrends API. The search volume data for 2024 was normalized and aggregated to form composite weekly indexes that quantify shifts in public interest. These indexes serve as behavioral indicators capturing attention dynamics across different investment mindsets.

To align investor attention with actual market activity, we retrieved weekly trading volumes for 19 representative TWSE-listed tickers using the yfinance library. These stocks were chosen based on their relevance to the attention themes, ensuring consistency between behavioral and market-based data. Trading volumes were normalized, and the resulting dataset was merged with the attention indexes along a weekly time axis to create a unified panel.

This dataset forms the basis for addressing our core research questions:

1. Do changes in public attention precede movements in trading activity (RQ1)?

2. Can attention indexes improve short-term predictive models (RQ2)?

3. Do major external events—such as U.S. Fed meetings or visits by industry leaders—simultaneously affect both investor attention and market engagement (RQ3)?

---

In [1]:
# If you have never used pytrends, you should install it
#!pip install pytrends
import pandas as pd
from pytrends.request import TrendReq
import time
import yfinance as yf
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## Building Subgroup Attention Indexes Using Google Trends

<div class="alert alert-block alert-danger">
<b>Warning:</b>

The following cell (I have turned it into a markdown cell just in case.) might fail if you run it too many times, as pytrends limit requests per IP address. For some reason, I can't get the same exact code to acquire the data, maybe my IP address is blocked by Google Trends. However, you still may get the data if you are careful with the process.
</div>

This section of the research builds a comprehensive picture of retail investor attention in Taiwan by analyzing search behavior from Google Trends. Instead of relying on a single keyword, we group related search terms into thematic clusters—such as ETFs, dividends, macroeconomics, and beginner investing—and create composite "attention indexes" that represent different investor mindsets. These indexes serve as behavioral signals that we can later compare to actual trading activity, test for predictability, and observe under macroeconomic shocks. By capturing multiple dimensions of attention, we aim to better understand how public interest reflects or influences financial market behavior.

```
# Initialize pytrends
pytrends = TrendReq(hl='zh-TW', tz=360)

# Define keyword subgroups
subgroups = {
    "ETF_Attention_Index": ['ETF 投資', '0050', '高股息 ETF', '00878', 'ETF 定期定額'],
    "Stock_Attention_Index": ['投資 股票', '台股 投資', '2330', '台積電', '當沖'],
    "Dividend_Attention_Index": ['高股息', '殖利率', '存股', '金融股', '配息'],
    "Beginner_Attention_Index": ['股票是什麼', '怎麼投資', '證券開戶', '股市新手', '股票入門'],
    "Macro_Attention_Index": ['升息', '通膨', '美國股市', 'FED', '經濟衰退'],
    "Tech_Attention_Index": ['半導體', '台積電', 'AI 投資', '高科技股', 'IC 設計']
}

# Timeframe and location
timeframe = '2024-01-01 2024-12-31'
geo = 'TW'

# Container for results
index_dfs = []

# Loop with 5-second delay
for index_name, keyword_list in subgroups.items():
    try:
        print(f"Fetching: {index_name}...")
        pytrends.build_payload(keyword_list, timeframe=timeframe, geo=geo)
        time.sleep(5)  # Delay to avoid 429 rate limit
        
        df = pytrends.interest_over_time().drop(columns='isPartial')
        df.columns = [col.replace(" ", "_") for col in df.columns]
        
        # Normalize
        df_norm = (df - df.mean()) / df.std()
        df_norm[index_name] = df_norm.mean(axis=1)
        
        index_dfs.append(df_norm[[index_name]])
    except Exception as e:
        print(f"Failed to fetch {index_name}: {e}")
        continue

# Merge all into one DataFrame
attention_index_df = pd.concat(index_dfs, axis=1)

# Show preview
attention_index_df.head()

# Save to Excel
attention_index_df.to_excel('attention_index_data.xlsx')
```

<div class="alert alert-warning">
<b>Message:</b> 
    
In case it doesn't run successfuly, I provided a link to the acquired data. Please check it out, I wouldn't delete it before the Spring semester of 2025 ends. And, if I do, I'm pretty sure that I'll put the `.csv` file in my repository.

</div>

Here is the link: [https://docs.google.com/spreadsheets/d/1TDK94m3D_oqx_hV-NZ5SwGBJXWGo9XmR/edit?usp=sharing&ouid=103068230126415922496&rtpof=true&sd=true](https://docs.google.com/spreadsheets/d/1TDK94m3D_oqx_hV-NZ5SwGBJXWGo9XmR/edit?usp=sharing&ouid=103068230126415922496&rtpof=true&sd=true)

<div class="alert alert-block alert-danger">
<b>Warning:</b>

You need to put the `attention_index_data.xlsx` file in the same folder as this Python script in order for the cell below to run.
</div>

In [2]:
attention_index_df = pd.read_excel('attention_index_data.xlsx', index_col=0)

## Merging Weekly Market Volume with Attention Indexes

This step connects behavioral data with actual market behavior. By combining Google Trends-based attention indexes with real-world trading volume, we create a unified dataset that allows us to explore how investor interest aligns with or influences financial activity. This merged view enables descriptive comparisons (e.g., trend co-movement), statistical correlation analysis (RQ1), and predictive modeling (RQ2). It also allows us to examine whether external events like Fed announcements shift both attention and market engagement (RQ3). Aligning these time series on a weekly basis ensures consistency and comparability across all variables.

### Mapping Attention Indexes to Representative TWSE Stocks

To ensure that our stock universe reflects the themes captured by each attention index, we selected representative TWSE stocks for each attention category:

| Attention Index             | Keywords                                                                 | Matched Tickers                                          |
|----------------------------|--------------------------------------------------------------------------|----------------------------------------------------------|
| **ETF_Attention_Index**     | "ETF 投資", "0050", "高股息 ETF", "00878", "ETF 定期定額"                 | "0050.TW", "006208.TW", "00878.TW", "00713.TW"            |
| **Stock_Attention_Index**   | "投資 股票", "台股 投資", "2330", "台積電", "當沖"                         | "2330.TW", "2303.TW", "2412.TW", "3008.TW"                |
| **Dividend_Attention_Index**| "高股息", "殖利率", "存股", "金融股", "配息"                               | "2881.TW", "2882.TW", "0056.TW", "1101.TW"                |
| **Beginner_Attention_Index**| "股票是什麼", "怎麼投資", "證券開戶", "股市新手", "股票入門"             | "9917.TW", "2603.TW", "2884.TW"                           |
| **Macro_Attention_Index**   | "升息", "通膨", "美國股市", "FED", "經濟衰退"                             | "1301.TW", "2308.TW"                                     |
| **Tech_Attention_Index**    | "半導體", "台積電", "AI 投資", "高科技股", "IC 設計"                      | "3034.TW", "2454.TW"                                     |

This logic ensures that our volume-based market signals are well-aligned with the **public attention captured in search behavior**, providing a meaningful basis for correlation and predictive analysis.

In [3]:
# Define tickers you care about
tickers = [
    '0050.TW', '006208.TW', '00878.TW', '00713.TW',   # ETF-related
    '2330.TW', '2303.TW', '2412.TW', '3008.TW',       # Stock-following
    '2881.TW', '2882.TW', '0056.TW', '9917.TW', '1101.TW',  # Dividend
    '2884.TW', '2603.TW',                             # Beginner-friendly
    '1301.TW', '2308.TW',                             # Macro-sensitive
    '3034.TW', '2454.TW'                              # Tech-specific
]

start_date = '2024-01-01'
end_date = '2025-01-01'

# Download daily data
prices = yf.download(tickers, start=start_date, end=end_date, group_by='ticker')

# Resample weekly volume and normalize
volume_dfs = []
for ticker in tickers:
    vol = prices[ticker]['Volume'].resample('W-SUN').sum()
    vol_norm = (vol - vol.mean()) / vol.std()
    volume_dfs.append(vol_norm.rename(f"{ticker}_Volume_norm"))

# Combine all volumes
volume_df = pd.concat(volume_dfs, axis=1)

# Merge with attention index
merged_df = pd.merge(volume_df, attention_index_df, left_index=True, right_index=True, how='inner')

# Preview merged data
merged_df.head()

YF.download() has changed argument auto_adjust default to True


[*********************100%***********************]  19 of 19 completed


Unnamed: 0,0050.TW_Volume_norm,006208.TW_Volume_norm,00878.TW_Volume_norm,00713.TW_Volume_norm,2330.TW_Volume_norm,2303.TW_Volume_norm,2412.TW_Volume_norm,3008.TW_Volume_norm,2881.TW_Volume_norm,2882.TW_Volume_norm,...,1301.TW_Volume_norm,2308.TW_Volume_norm,3034.TW_Volume_norm,2454.TW_Volume_norm,ETF_Attention_Index,Stock_Attention_Index,Dividend_Attention_Index,Beginner_Attention_Index,Macro_Attention_Index,Tech_Attention_Index
2024-01-07,-0.860674,-0.932212,-0.379417,-0.874601,-1.254108,-0.166771,-0.911028,0.418549,-1.167018,-1.203771,...,-1.403838,-1.300789,-0.4665,0.439787,-1.21605,-1.367152,-0.199133,-0.899066,0.390401,-0.575975
2024-01-14,-0.653673,-0.93129,-0.708484,-0.822152,-1.29524,-0.650347,-0.810122,1.288575,-0.906104,-0.99804,...,-0.842457,-0.264088,-0.352294,-0.322731,-0.116791,-0.582552,0.236205,-0.666854,-0.072779,-0.180896
2024-01-21,1.345122,-0.47863,-0.28,-0.611415,1.391158,0.676357,-0.147496,0.498769,0.186507,-0.045758,...,0.223543,0.332889,1.014547,0.856694,-0.433229,-0.582552,-0.121921,-0.173195,-0.007083,-0.180896
2024-01-28,0.804586,-0.368364,-0.302921,-0.923418,0.623496,1.909554,-0.861677,-0.341718,-0.874992,-1.063764,...,-0.901903,-0.248834,-0.524553,0.512786,-0.119907,-1.067593,-0.194599,-0.56343,-0.396802,-0.685401
2024-02-04,0.260986,-0.821956,-0.410453,-1.048354,-0.096706,-0.126591,-0.394153,0.253704,-0.933695,-0.99709,...,-0.971914,-0.339969,-0.359664,1.502433,-1.38365,-2.106888,-1.136914,-0.922065,-0.981105,-1.908722


In [4]:
merged_df.to_excel('merged_df.xlsx')