Final Report Submission

BIG DATA PROGRAMMING - FINAL REPORT

PROJECT TITLE AND TEAM MEMBERS

International Student Data Analysis using Big Data.

Tanvi Jain
Saikumar Reddy Papagari
Amarnadha Reddy Ankireddypalli
Thotakura Naga Mounika

INTRODUCTION

This project focuses on extraction of dataset (Twitter Data) from the Twitter API regarding all the information related to international students and using different tools like Hive, Hue, Scala, pySpark, Spark SQL and Solr to show different aspects of the data in a visualization form using Tableau and Seaborn.

GitHub Link: https://github.com/xlr8r53/BDP-Project-Team-6/wiki/Final-Report-Submission

PPT Link: https://github.com/xlr8r53/BDP-Project-Team-6/blob/fe992332219ce41cb6877a771786d0d9fb77d02e/Final_Presentation.pptx

Video Link: https://umkc.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=1367520c-3de7-4ecb-825f-ad25001fd55c

BACKGROUND

Twitter sentiment analysis is done in different ways before, using panda is the most common way. We will be using different hashtags which are related to international students and events related to international students to get a hold of better results based on those hashtags and more information about what events really took place and different aspects they bring out. Here are some references of projects people have done implementing twitter analysis with big data, we are going to follow some aspects of these projects, but our dataset is new and is extracted only for this project, so we could only find references related to those projects which have totally different dataset and completely different objectives.

GOALS AND OBJECTIVES

Motivation:

Studying abroad is a journey of education and discovery. There are currently over 1 million international students from more than 220 countries, coming to the United States annually. The individuals from this group belong to the international student community and we came to the United States to get a higher education. There are numerous situations where we are relied upon to follow those standards which the overall US citizens are not expected to follow in the long run since we have a place with the foreigner gathering. So, this gave us the establishment for this task, and we chose to feature the fundamental information like percentage of students going to the United States for education, the probability of getting a work VISA, Immigration rules change for F-1 Visa during COVID, Jobs for international students during Covid (H1B sponsorship).

Significance:

Big data tools help to analyze the huge data which helps to provide efficient results. The sentimental analysis provides a brief understanding of various challenges that international students are currently facing and impact of covid-19 on the visa assurance. Finally, we are using Spark using python and Scala for writing the queries by visualizing with Seaborn and Tableau.

Objectives:

The objective of this project is to get twitter data extraction using Twitter Data Analysis and then cleaning the data, performing sentiment analysis, and importing files into Hadoop where we used different tools like Hue, Hive, Solr, Scala, Spark SQL and pySpark.

Features:

The main feature of the project is to collect the Real timed tweets from the twitter API, also by performing the ETL which means we preprocess the data using Texthero and extract the necessary data and then we load the extracted data in our HIVE. Performed topic modelling using LDA and genism model, visualized the top 4 topics by t-SNE visualization and performed analysis on data using PySpark, Solr and Scala

DATASET:

We collected the data using twitter API using developer account and API keys. We have used different hashtags to get data related to 3 different scenarios as follows:

Generic International Student Data (#F1visa, #intlstudents, #internationalstudents, #studyinUSA)
Immigration rules change for F1 Visa during COVID in 2020 (#AbolishICE)
Jobs for international students during covid (H1B sponsorship) (#H1B, #h1bjobs) From the extraction we have the information about the tweet itself like this:

The information about the user as well - It has all the different fields, which we will filter and use later using big data tools to perform queries and visualization.

It has all the different fields, which we will filter and use later using big data tools to perform queries and visualization. Features and their description:

		Table: Description of Attributes of Data

Detail design of Features with Project Workflow:

DATA ANALYSIS:

We have extracted dataset using twitter API, these datasets are extracted using different hashtags focusing on different events related to international students, dividing the events and dataset is crucial since we want this project to be informational and focus on different aspects through different datasets. We will use sentiment analysis on the text (tweet) and use different queries to extract important features and visualize them in Tableau. Data Collection:

Data Preprocessing:

a.

b.

Fig. Plotting the Top words from the data.

c.

Fig: Adding PCA and k-means clusters to the preprocessed data.

d.

Fig: Scatter plot hovering the k-means and PCA clustered data

IMPLEMENTATION:

Solr:

1. Creation/generation of instance & Collection:

2. Edit the schema.xml created with the instance generation inside the configuration folder to change the attributes based on the dataset given:

3. Open the solr browser in the web browser and select the create collection on the left side dropdown.

4. Set "Tweet Id" as the primary key.

5. Select the document type to csv. Then copy paste all the data inside the dataset into the documents field and submit the document.

Queries

1. Pulled the 10 records of data:

2. Pulled the tweets that are tweeted in English:

3. Pulled the data has 'F1Visa' included in the Text (Regex):

4. Collected the response with having Text:”student” and Text:”USA”:

5. Pulled the most retweeted tweets:

Converting working dataset (in .csv) to .json file.

Place the .json file in the cloudera working directory.

Spark using Scala

1. Starting the scala using spark-shell command.

2. Load the 'dat1.json' into a data frame using sqlContext.read.json()

3. Schema of the dataset.

4. Display the Username and Text of top 20 tweets.

5. Count the number of Media Types present in the dataset.

6. Fetching the different languages in which the tweets are made about international students.

7. Fetching users with a greater number of hashtags in their tweets on student data.

8. Fetching users who have more followers and are tweeting based on student data.

9. Fetching users with a greater number of mentions in their tweets on student data.

10. Fetching users having account names with a greater number of tweets on web series.

11. Fetching users with a greater number of retweets for their tweets on student data.

Spark using Python

Considered python programming because it is much quicker than Scala data-frames in terms of execution time – Visualization is achieved with Matplotlib and Seaborn in python programming.

Download Java to run the Java Virtual Machine (JVM).

Set the environment path. This will enable us to run Pyspark in the Colab environment.

Import SparkSession from pyspark.sql and create a SparkSession.

1. Import the dataset and create data frames directly on import.

2. Check for duplicate records and null values in the dataset.

Queries:

1. Display first 100 rows order by the Date tweeted.

2. Count the number of tweets based on the sources of data by performing the Union operation.

3. Counted the tweets by grouping into languages and order in descending order with pulling top 5.

4. Pull the top 10 user_name by followers_count by dropping the duplicates.

5. Pull the count of tweets based on the desired topics from the Text attribute.

6. Display the screen names of tweets that are mostly retweeted.

7. Display the screen name of tweets that has highest mentions.

8. Display the screen name of tweets that has most hashtags in it.

9. Plot the Users with highest number of MediaURLs.

10. Display the count of different types of tweets (Retweet, Reply or Tweet).

11. Analysing the user data with pyspark:

12. Checking the number of verified and unverified accounts tweeted about the international student cause.

13. From this visualization it shows that there were more tweets in the year 2020 specifically as compared to other years since international students were in the talk in this year due to the corona pandemic, when ICE decided to change the immigration rules and h1b sponsorship was affected due to lack of jobs and recession.

14. Here are the countries that tweeted most and the least regarding international student scenario in United States specifically.

Pyspark, Spark SQL, NLP and SQL queries for both users and tweets:

Here are 2 datasets, one covers the information about the tweets, and other about the users.

Changing the Spark SQL Dataframe Column type from one data type to another data to make the analysis more accurate and meaningful.

Spark SQL Queries

The total number of verified users who participated in tweeting for the international students, their followers, the people they are following and the favorites.

Tweets Data

Sentiment Analysis Trend with time for the year 2020:-

Visualization of the trend of the tweets by verified and non verified users for the different years regarding the international students and ICE immigration rule changes, It also includes the tweets about h1b sponsorship.

HIVE:

Created the database and database table in hive and loaded the data into hive table for AbolishICE dataset:

Created the database and database table in hive and loaded the data into hive table for F1visa dataset: Visualized the output:

Created the database and database table in hive and loaded the data into hive table for h1b dataset: Visualized the output:

Created the database and database table in hive and loaded the data into hive table for intlstudents dataset: Visualized the output:

Created the database and database table in hive and loaded the data into hive table for study in USA dataset: Visualized the output:

Imported file into hdfs

Created the database and database table in hive and loaded the data into hive table for General dataset:

Visualized the Output:

Queries:

1. Viewing the number of tweets based on h1b:

2. Viewing the number of tweets based on F1visa:

3. Details of tweets regarding international students that have a greater favourites count:

4. Details of tweets regarding international students using Order by:

5. Viewing the details of tweet id from h1b:

6. Details of retweets with limit from h1b:

RESULTS EVALUATION:

a.

b.

c.

d.

e.

Fig. Plotting top words from Text data

f.

Fig. Word Cloud generation using Texthero

g.

Fig. Scatter Plot of PCA and k-means cluster analysis

CONCLUSION:

Done the data extraction using Twitter API for different hashtags and implemented them in various Big Data tools like Hive, Spark and PySpark. The visualization is done using PySpark in the form of bar graphs, pie-charts, and scatter plots. The sentiment analysis of the tweets is performed using pyspark and some visualization is done using different tools like TSNE and tableau.

FUTURE WORK:

Implementing the better Machine Learning Algorithms.
Working on more data sources such as Instagram, Facebook, and YouTube Data as well.
Considering the project deployment in Docker.
Better use of the big data tools by targeting on a specific predefined and focused dataset.
A tool can be created using ML tools and analysis which on the basis of streaming tweets can tell the sentiments and prospective suggestions about a specific situation or crisis, which would help people calm down and be productive in the right track instead of panicking.

PROJECT MANAGEMENT:

Description:

All the tasks that were lined up for the final project have been completed successfully in a timely fashion. We were able to implement different big data tools to analyse the data about international students and how they were impacted by the COVID situation and crisis.

Work Completed :

We have completed the extraction of dataset from twitter using twitter API, completed the tweets preprocessing part, sentiment analysis of those tweets. The data which was extracted was converted from csv to json format. We have also worked on Hive, Scala and pySpark to perform different analyzing techniques. Loaded dataset into solr to analyze various aspects from the data. Visualized the results using Tableau and Seaborn.

Responsibility (Task, Person):

Twitter data streaming (Tanvi, Mounika)
Data processing (Tanvi, Amar)
Conversion of csv to json (Saikumar)
Spark streaming using Scala (Saikumar)
Solr execution and queries (Amar)
pySpark queries and plotting (Amar, Tanvi)
Word cloud, PCA and k-means clustering using Texthero (Amar)
Hive implementation (Mounika)
Visualization in Seaborn (Amar)
Word Cloud using pyspark (Tanvi)
Topic modelling to find dominant terms (Tanvi)
TSNE clustering & Visualization in Tableau (Tanvi)
Sentiment analysis and trend analysis in pyspark(Tanvi)

Contributions (members/percentage):

Tanvi Jain (25%)
Amarnadha Reddy Ankireddypalli (25%)
Naga Mounika Thotakura (25%)
Saikumar Reddy Papagari (25%)

Issues/Concerns:

The data we gathered from Twitter in the beginning is unprocessed. So, we faced challenges in combining all the extracted .csv files into one single file.
Faced challenges in converting .csv file to .json file due to large data.

STORY TELLING:

WHO WHAT WHEN WHERE WHY HOW

CHAPTER 1

The international students who are studying in the United States of America (or overseas)
The problem mainly affected them in terms of finance, job hunting, allowance to work with any company at any time or lack of getting accepted by any company due to limited work authorization and stress because of lack of stability in life.
There is no specific time/place of problem, it is the general situational challenges that international students come across.
These challenges can be faced by international students in a foreign land, where they are a part of the immigrant community and do not have access to the perks that the citizens of that country get, which makes life a bit harder for even the very basic requirements of living.
Due to a set of immigration rules and less availability of jobs/ work authorization in desired fields.
There is not a certain time that the chances of happenings like these because these kinds of incidents are seen and experienced by many international students over the time.

CHAPTER 2

The international students who are studying in the United States of America (or overseas). The dataset is extracted using different hashtags to get data related to 3 different scenarios as follows:

Generic International Student Data (#F1visa #intlstudents #internationalstudents #studyinUSA) Immigration rules change for F1 Visa during COVID in 2020 (#AbolishICE) Jobs for international students during covid (H1B sponsorship) (#H1B, #h1bjobs)

Yes, the data set records the targeted events, activities, behaviors, etc. in Assignment 1. This is fundamentally about the variables. It records the username, the location, the tweets which tell us about what the users really think about the specific event that happened.
The events take place on how people reacted to the challenges faced by International students during COVID such as immigration rules for F1visa, jobs for international students during covid(H1B sponsorship).
These challenges can be faced by any International Student in a foreign land, where they are a part of immigrant community and do not have access to the perks that the citizens of that country get, which makes life a bit harder for even the very basic requirements of living.
Due to the increase in the intake of international students over a period.
It happened during COVID, as a result there was a massive unavailability of jobs (new graduates), many have lost their jobs due to pandemic, immigration rules change for F1 Visa during COVID.

CHAPTER 3 (Scientist and AI)

Students from other countries studying in the United States of America or Overseas
The preprocessed data can be fitted to any of the ML and Deep Learning models. The text is used for finding dominant topic through topic modeling and sentiment analysis. All the other features of the data about tweet and users are used to visualize the important stats using pyspark.
The extracted data can be used to find the accuracy and runtime performance of best fitting ML model.
It was part of the Big Data Programming course offered in University of Missouri-Kansas City
Most of the unsupervised learning models work best for our extracted data.
The data scientist may use data in the way that the research requires or based on the tool. Nothing in this method is static; new data or data sources may be introduced as required for further review.

CHAPTER 4 (Users)

International students
The visualisation shows the analysis of twitter data on international students through which we analysed various challenges faced by them due to covid.
This project can be used to understand the present crisis due to covid such as travel ban, student’s in-take, and visa issues.
This project is visualised using Seaborn and Tableau and can be deployed in Docker as well.
The visualised data can be used to comprehend the travel bans, college admissions, and visa issues due to covid crisis.
The international students can utilize this project work to plan their overseas study in a better way considering all the challenges that may face by them oncourse of their overseas journey.

CHAPTER 5 (Society)

In this project the society of international students/immigrants in the USA are targeted during the covid crisis. The data scientists who have worked on this project are a part of the international student society who thought it was necessary to address this situation and analyse it through using big data tools. Data Scientists - Tanvi Jain, Saikumar Reddy Papagari, Amarnadha Reddy Ankireddypalli, Thotakura Naga Mounika.
Yes definitely there is a social and cultural impact through this project, since it will help to identify how people were effected and there was a drastic change in the nature of tweets for this situation.
There are no privacy concerns because the data that is analysed is publicly available on the social media platform tike twitter.
The social impact takes place every now and then, though the data or the timeline of the data that this project is covering is for the year 2020 and 2021 specifically and comparing it with the other previous years when COVID did not hit the world.
The cultural impact takes place in the setting of the USA and the immigration rules change due to the result of recession.
It is important for the upcoming international students and immigrants to have the knowledge about how trend and immigration rules change owing to the change in the world economy and infrastructure. They should have a clear view and idea about how their life can be affected in a positive or a negative way as a result of recession.
Specially for the part or the community which has a big part or percentage in the society to play, ML should have specific tools predefined, that could suggest and guide people based on the sentiments of society to a specific scenario, so that instead of getting false information from wrong resources, ML can be used as a valid resource to look forward to when such situation occurs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final Report Submission

BIG DATA PROGRAMMING - FINAL REPORT

PROJECT TITLE AND TEAM MEMBERS

International Student Data Analysis using Big Data.

INTRODUCTION

BACKGROUND

GOALS AND OBJECTIVES

Motivation:

Significance:

Objectives:

Features:

DATASET:

Detail design of Features with Project Workflow:

DATA ANALYSIS:

Data Preprocessing:

IMPLEMENTATION:

Solr:

Queries

Spark using Scala

Spark using Python

Queries:

HIVE:

Queries:

RESULTS EVALUATION:

CONCLUSION:

FUTURE WORK:

PROJECT MANAGEMENT:

Issues/Concerns:

STORY TELLING:

CHAPTER 1

CHAPTER 2

CHAPTER 3 (Scientist and AI)

CHAPTER 4 (Users)

CHAPTER 5 (Society)

REFERENCES:

Clone this wiki locally