A Python utility that converts Twitter archive data (in JavaScript format) to a CSV file format for easier data analysis and processing.
This application reads a Twitter archive file (tweets.js
) and converts it into a structured CSV format with enhanced analytical features. It extracts key information from each tweet including:
- Tweet ID
- Creation date
- Tweet text
- User name
- User screen name
- Retweet count
- Favorite count
- Hashtags
- User mentions
- URLs
The converter now includes additional columns specifically designed to help AI analysis:
- Tweet type (reply, retweet, or original)
- Engagement rate (calculated from retweets and favorites)
- Time-based features:
- Hour of day
- Day of week
- Month
- Year
- Weekend indicator
- Content analysis:
- Tweet length
- Question presence
- Exclamation presence
- Emoji usage
- Word count
- Python 3.x
- Standard Python libraries:
- json
- csv
- argparse
- pathlib
- datetime
- re
Basic usage:
python Tweet2CSV.py
Advanced usage with command-line arguments:
python Tweet2CSV.py --input /path/to/tweets.js --output output.csv --encoding utf-8
--input
,-i
: Path to the input tweets.js file (default: /users/keithtownsend/downloads/twitter/data/tweets.js)--output
,-o
: Path to the output CSV file (default: tweets.csv)--encoding
,-e
: File encoding (default: utf-8)
The script expects a Twitter archive file (tweets.js
) in the following format:
window.YTD.tweets.part0 = [
{
"tweet" : {
"edit_info" : {
"initial" : {
"editTweetIds" : [
"1839419668525961279"
],
"editableUntil" : "2024-09-26T22:39:58.000Z",
"editsRemaining" : "5",
"isEditEligible" : false
}
},
"retweeted" : false,
"source" : "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>",
"entities" : {
"hashtags" : [...],
"user_mentions" : [...],
"urls" : [...]
},
"id_str" : "...",
"created_at" : "...",
"text" : "...",
"user" : {
"name" : "...",
"screen_name" : "..."
},
"retweet_count" : 0,
"favorite_count" : 0
}
},
...
]
The script generates a CSV file with the following columns:
- id
- created_at
- text
- user_name
- user_screen_name
- retweet_count
- favorite_count
- hashtags (semicolon-separated)
- mentions (semicolon-separated)
- urls (semicolon-separated)
- tweet_type (reply/retweet/original)
- engagement_rate (percentage)
- hour_of_day (0-23)
- day_of_week (Monday-Sunday)
- month (January-December)
- year
- is_weekend (true/false)
- tweet_length (character count)
- has_question (true/false)
- has_exclamation (true/false)
- has_emoji (true/false)
- word_count
A comprehensive data dictionary is provided in DATA_DICTIONARY.md
that explains:
- Each field in the CSV file
- How to interpret the values
- Common analysis scenarios
- Industry-specific metrics
- Best practices for data analysis
The data dictionary is designed to help AI tools like ChatGPT better understand and analyze your tweet data.
-
Rainbow CSV (VS Code Extension)
- Color-codes CSV columns for better readability
- Validates CSV formatting
- Provides SQL-like querying capabilities
- Makes it easier to spot patterns in your data
- Installation: Search for "Rainbow CSV" in VS Code extensions
-
Excel/Google Sheets
- Familiar spreadsheet interface
- Built-in filtering and sorting
- Pivot tables for data aggregation
- Charts and visualizations
- Good for sharing with team members
-
Pandas (Python Library)
- Powerful data analysis capabilities
- Can handle large datasets efficiently
- Extensive statistical functions
- Integration with visualization libraries
- Example usage:
import pandas as pd df = pd.read_csv('tweets.csv') # Analyze engagement by day of week print(df.groupby('day_of_week')['engagement_rate'].mean())
-
ChatGPT
- Upload the CSV file and data dictionary
- Ask specific analysis questions
- Get insights and recommendations
- Example prompt: "Analyze my tweet data and tell me which topics get the most engagement"
-
Claude
- Similar capabilities to ChatGPT
- Often better at handling structured data
- Can provide more detailed analysis
-
Custom AI Analysis Scripts
- Create Python scripts using libraries like scikit-learn
- Build predictive models for engagement
- Generate automated reports
The script includes comprehensive error handling for:
- File reading errors
- JSON parsing errors
- CSV writing errors
- Invalid file paths
- Encoding issues
- Date parsing errors
You can customize the script for your specific needs:
-
Adding New Fields:
- Edit the
csv_header
list in thewrite_csv
function - Add corresponding data extraction in the row creation section
- Edit the
-
Changing Analysis Logic:
- Modify the analysis functions (
classify_tweet_type
,calculate_engagement_rate
, etc.) - Add new analysis functions as needed
- Modify the analysis functions (
-
Adjusting Tweet Structure Parsing:
- Update the
load_tweets
function if your tweet archive has a different structure - Modify the tweet data extraction in the
write_csv
function
- Update the
The script can be adapted for various use cases:
-
Personal Brand Analysis:
- Focus on engagement metrics and content analysis
- Use the data dictionary's "Thought Leadership Impact" scenarios
-
Business Marketing:
- Add fields for campaign tracking
- Focus on conversion metrics and audience analysis
-
Content Creator Analysis:
- Add fields for content categories
- Focus on content performance across different topics
-
Community Management:
- Add fields for community engagement metrics
- Focus on interaction patterns and response effectiveness
-
Technical Content Analysis:
- Use the tech industry-specific scenarios in the data dictionary
- Focus on technical topic performance and educational content
- Make sure you have the necessary permissions to read the input file and write to the output directory
- The script will overwrite any existing output CSV file
- For large tweet archives, the conversion process might take some time
- The script supports UTF-8 encoding by default, but you can specify a different encoding if needed
- The enhanced analytical features are designed to help AI tools like ChatGPT better analyze your tweet performance patterns