<img src='http://imgur.com/1ZcRyrc.png' style='float: left; margin: 20px; height: 55px'>


# DSI-SG-42 Capstone Project:
### Silent Echoes: From Hand Waves to Written Phrases



<div style="text-align: center;">
    <img src="https://kidscarehomehealth.com/wp-content/uploads/2022/08/KCHH-sign-language-hello.jpg" alt="ASL image of signed Hello" style="width: 65%; height: auto;">
</div>

# 1. Introduction 

Across the globe, there are millions of people who uses sign language as their primary mode of communication in their daily lives. Whether the person is deaf, hard of hearing, or even mute, they use that to communicate between friends, family, and work environments. However, there is a communication gap between those that are unable to communicate by spoken language and those that do not understand sign languages. 

In the current world population, there are over [430 million people or over 5% of the world's population](https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss) who have disabling hearing loss. [Hearing loss](https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss) can be caused by a wide array of factors such as genetics, ear infections, some chronic diseases, and trauma to the ear or head. There are around 34 million children who are impacted by hearing loss and some would have to attend special schools to learn how to communicate with the general population. 



## 1.1 Problem Statement

Creating an effective communication system between a deaf individual who uses sign language and a non-signing person presents significant challenges. The initial goal of this project is to lay the foundation for such a system by developing a solution that can accurately recognize and translate a select set of key signs - currently three critical words - into written language.


**This project aims to demonstrate the feasibility of scaling this technology to enable more comprehensive interactions and foster understanding between signing and non-signing individuals.**


## 1.2 Aims

The aim of this project is to bridge the gap by using data science to help the general public to understand sign languages so that the *signer* can convey their message naturally and fluidly without losing any of the nuances in body language. 

## 1.3 Import and Cleaning of Dataframe

### [HandSpeak](https://www.handspeak.com/word/most-used/)

This website recommends a list of words that a peson who wants to learn sign language. It is split into blocks of 100 words. We will be focusing on the first page with the most common sign languages. 

This website offers a structured learning resource for aspiring sign language learners. The resource takes the form of a categorized vocabulary list of one hundred words each. Our initial exploration will be based on the words presented on the first page, focusing on the most frequently utilized words in everyday communication. These signs will encompass pleasantries, question words, and common everyday used words. The website structures the words in alphabetical order rather than learning order, we will shortlist 10 words from the list to build the initial model.

*n.b. as the video files and extracted landmark .npy files are too big for GitHub, they can be found on my [Google Drive](https://drive.google.com/drive/folders/1FcOdltQ70QdpTyonlIWNVPSipTXkLZ6m?usp=sharing)*

In [39]:
# 10 Chosen words

chosen_words = ['eat', 'hello', 'name', 'no', 'please', 'sorry', 
                'thank you', 'want', 'what', 'where']

In [65]:
# import libraries
import os
import shutil
import datetime
import logging
import pandas as pd
from pytube import YouTube
from collections import defaultdict

# Include a logging file
log_dir = ('../log_files')
if not os.path.exists(log_dir):
    os.makedirs(log_dir)

log_file = os.path.join(log_dir, 'video_download.log')

# Setup Logger
logger = logging.getLogger()  # Get the root logger
for handler in logger.handlers:  # Remove all old handlers
    logger.removeHandler(handler)


logging.basicConfig(
    level=logging.INFO,
    format = '%(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file),
        logging.StreamHandler()  
    ]
)

# set display settings
%matplotlib inline
pd.set_option('display.width', 100000)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

## 1.4 Download Training Videos from Microsoft-ASL dataset

Using pytube library, we can download videos from YouTube if a link to the video is available. Depending on the availability of the video, some restrictions to download the videos are the video is unavailable, made private, or age-restricted. As the videos we are searching for is general education content on what each word is signed, we are not expecting many age-restricted errors or any at all. Unfortunately, for the videos that are unavailable or have been made private, they are not available for download at all. 


---
### 1.4.1 Microsoft-ASL *Train* Dataset
---

In [3]:
# import Microsoft-ASL training json file
df = pd.read_json('../data/MS-ASL/MSASL_train.json')

print(df.head())

                org_text     clean_text  start_time  signer_id  signer  start  end                          file  label  height    fps  end_time                                          url           text                                                                               box  width  review
0  match [light-a-MATCH]          match         0.0          0       0      0   83           match light-a-MATCH    830     360  30.00     2.767  https://www.youtube.com/watch?v=C37R_Ix8-qs          match              [0.057544618844985004, 0.21637457609176602, 1.0, 0.7300844192504881]    640     NaN
1                   FAIL           fail         0.0          0      -1      0   74                          FAIL    542     360  25.00     2.960  https://www.youtube.com/watch?v=PIsUJl8BN_I           fail  [0.06577941775321901, 0.167171776294708, 0.9392627477645871, 0.9187960028648371]    480     NaN
2                  laugh          laugh         0.0          4      26      0   31  SignSchool

Looking at the first 5 rows, we can observe a few things:

- There are multiple column headers that seems repetitive. 
    - The headers `org_text` and `file` are similar but file is a cleaner version. 
    - The same can be observed for `clean_text` and `text`
- `height` and `width` represents the dimensions of the video
- `start_time` and `end_time` are the times that the sign starts in seconds 
- `start` and `end` are the frames that the sign starts and ends
- A signer's identification to the video labelled as `signer_id` and `signer`
- `url` for the link but some of the urls are missing the initial part of a web address 'https://'
- `box` represents the boundary box from (x1,y1) to (x2,y2) which is normalized by the videos `height` and `width`
- `fps` the frame rate of the video
- `label` is an integer that represents the gloss (word) of the video

#### URL Cleaning

In [4]:
# some files do not have https so we will check and include 'https://' into the url
urls = df['url'] # create a list of all urls from the dataframe


# use a for loop to add a prefix 'https://' to the front of the urls if
# the url starts with 'www.'
for i, url in enumerate(urls):
    if not url.startswith('https://'):
        url = 'https://' + url
        df.at[i,'cleaned_url'] = url # create a new column header and append the url there
    else:
        df.at[i,'cleaned_url'] = url # append the current url to the new header


print(df.head()) # inspect the dataframe to check if the header is added


                org_text     clean_text  start_time  signer_id  signer  start  end                          file  label  height    fps  end_time                                          url           text                                                                               box  width  review                                  cleaned_url
0  match [light-a-MATCH]          match         0.0          0       0      0   83           match light-a-MATCH    830     360  30.00     2.767  https://www.youtube.com/watch?v=C37R_Ix8-qs          match              [0.057544618844985004, 0.21637457609176602, 1.0, 0.7300844192504881]    640     NaN  https://www.youtube.com/watch?v=C37R_Ix8-qs
1                   FAIL           fail         0.0          0      -1      0   74                          FAIL    542     360  25.00     2.960  https://www.youtube.com/watch?v=PIsUJl8BN_I           fail  [0.06577941775321901, 0.167171776294708, 0.9392627477645871, 0.9187960028648371]    480     NaN  htt

When we download the videos from YouTube using pytube, we are expecting the name of the file to be `file` but with a suffix of '.mp4'. Here, we will add the suffix to all the files in the dataframe to help with the preprocessing.

#### Adding suffix to file name

In [5]:
# as the file names inside the dataframe do not have '.mp4' and have trailing whitespaces,
# we will clean the values in the 'file' column 

df = df.copy()
df['file'] = [file.strip() + '.mp4' for file in df['file']]

print(df.head()) # inspect the dataframe

                org_text     clean_text  start_time  signer_id  signer  start  end                              file  label  height    fps  end_time                                          url           text                                                                               box  width  review                                  cleaned_url
0  match [light-a-MATCH]          match         0.0          0       0      0   83           match light-a-MATCH.mp4    830     360  30.00     2.767  https://www.youtube.com/watch?v=C37R_Ix8-qs          match              [0.057544618844985004, 0.21637457609176602, 1.0, 0.7300844192504881]    640     NaN  https://www.youtube.com/watch?v=C37R_Ix8-qs
1                   FAIL           fail         0.0          0      -1      0   74                          FAIL.mp4    542     360  25.00     2.960  https://www.youtube.com/watch?v=PIsUJl8BN_I           fail  [0.06577941775321901, 0.167171776294708, 0.9392627477645871, 0.9187960028648371]    480 

As there are column headers that we do not use and the columns are disorganized, we will create a subset of the main dataframe to only limit the column headers that we require and have the headers ordered neatly. We will also sort the dataframe alphabetically to `clean_text` column. 

#### Create a cleaner and organized dataframe

In [6]:
# create a cleaner dataframe with the headers that will be needed

filtered_train_df = df[['clean_text',  'file', 'label', 'height', 'width', 'fps', 'start', 'end',  'cleaned_url']]

# Sort the DataFrame based on the 'clean_text' column
filtered_train_df_sorted = filtered_train_df.sort_values(by='clean_text')

print(filtered_train_df_sorted.head())

      clean_text                         file  label  height  width    fps  start   end                                         cleaned_url
11077       #apt  ASL 1 Unit 3 Vocabulary.mp4    656     360    640  29.97   8213  8332         https://www.youtube.com/watch?v=KRrKqGEGdMg
10733       #asl                   ASL(1).mp4    309     360    636  29.97      0    71         https://www.youtube.com/watch?v=UyvO5kp4ivQ
2908        #asl  ASL 1 Unit 2 Vocabulary.mp4    309     360    640  29.97   2701  2759         https://www.youtube.com/watch?v=f7COHRpmVKA
1909        #asl          CCHS ASL Unit 1.mp4    309     720    960  29.97    653   704  https://www.youtube.com/watch?v=eXpXg4q-qEQ&t=170s
1908        #asl          CCHS ASL Unit 1.mp4    309     720    960  29.97    423   513  https://www.youtube.com/watch?v=eXpXg4q-qEQ&t=170s


Before we initiate the download, we will want to check how many entries are there in each of the signed words inside the Microsoft-ASL train dataset

In [24]:

# Initialize defaultdict for word counts
word_counts = defaultdict(int)

# Iterate over each row in the DataFrame
for text in filtered_train_df_sorted['clean_text']:
    # Split text into words
    words = text.split()
    # Increment count for each word
    for word in words:
        word_counts[word] += 1

# Sort word counts by count value in descending order
sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

# Print word counts
total_words = 0
for word, count in sorted_word_counts:
    print(f'{word}: {count}')
    total_words +=1

print(f'''
      
We can see that there is a total number of {total_words} different words in this dataset''')


not: 117
you: 112
up: 88
like: 81
school: 79
hearing: 70
want: 70
nice: 67
year: 61
every: 60
all: 59
night: 59
friend: 59
good: 58
morning: 57
eat: 57
week: 56
me: 55
know: 55
spring: 55
student: 54
understand: 53
love: 52
help: 51
how: 51
hair: 50
orange: 50
teacher: 50
france: 48
what: 48
white: 48
room: 47
fish: 47
deaf: 46
milk: 46
no: 46
where: 46
yes: 46
again: 44
day: 44
cousin: 44
afternoon: 43
drink: 43
mother: 42
nothing: 42
sister: 42
yellow: 42
finish: 41
last: 41
learn: 41
water: 41
write: 41
blue: 40
fine: 40
meet: 40
sad: 40
table: 40
tired: 40
black: 39
english: 39
forget: 39
green: 39
pencil: 39
please: 39
red: 39
go: 38
happy: 38
i: 38
name: 38
play: 38
brother: 37
brown: 37
doctor: 37
father: 37
lost: 37
apple: 36
book: 36
computer: 36
family: 36
grandmother: 36
hurt: 36
sick: 36
sign: 36
sit: 36
dance: 35
draw: 35
girl: 35
man: 35
next: 35
nurse: 35
see: 35
paper: 35
sandwich: 35
sunday: 35
when: 35
bus: 34
grandfather: 34
here: 34
pink: 34
purple: 34
take: 34
than

From the count of each words there are imbalances if we were to choose any word to start with. The most number of video links available on the MS-ASL train dataset is `not` at 117 video links to the least number of word `fire` with only a single video link. 

Considering that we did initially choose 10 words from the list from the webpage Handspeak. We will obtain the number of video links listed in the dataframe.

In [25]:
# Initialize a dictionary to store counts
word_counts = {}

# Iterate over the list of words
for word in chosen_words:
    # Filter the dataframe based on the word in the 'clean_text' column
    train3_df = filtered_train_df_sorted[filtered_train_df_sorted['clean_text'].str.contains(word, case=False)]
    
    # Count the number of rows in the filtered dataframe and store the count
    word_counts[word] = len(train3_df)

# Print the counts
for word, count in word_counts.items():
    print(f"Occurrences of '{word}': {count}")



Occurrences of 'eat': 108
Occurrences of 'hello': 30
Occurrences of 'name': 38
Occurrences of 'no': 435
Occurrences of 'please': 39
Occurrences of 'sorry': 25
Occurrences of 'thank you': 34
Occurrences of 'want': 70
Occurrences of 'what': 64
Occurrences of 'where': 46


The word that has the most videos are the word *'eat'* at 108 video links while *'sorry'* has 25 links for training.

We will keep this initial list simple, with just 3 words to start preprocessing and model. The words *please*, *sorry*, and *hello* are shortlisted as they are words that could be use right from the beginning, and also, they have a similar number of videos.


So the following will be the new list of words:

In [26]:
first_three = ['please', 'sorry', 'hello']

print(first_three)

['please', 'sorry', 'hello']


In [27]:
train_df = filtered_train_df_sorted[filtered_train_df_sorted['clean_text'].isin(first_three)]

print(train_df.head())
print(train_df.shape) # 103 video links for the 3 sign actions

      clean_text                                            file  label  height  width    fps  start   end                                  cleaned_url
14003      hello  American Sign Language Learning the basics.mp4      0     360    640  29.97   1874  1903  https://www.youtube.com/watch?v=yG09SQB0Hds
653        hello                        Unit 1 Vocabulary(2).mp4      0     360    640  29.97    280   362  https://www.youtube.com/watch?v=7YYB3BEoksc
652        hello                        Unit 1 Vocabulary(2).mp4      0     360    640  29.97    250   277  https://www.youtube.com/watch?v=7YYB3BEoksc
3434       hello                        ASL Vocabulary hello.mp4      0     360    640  29.97    110   150  https://www.youtube.com/watch?v=9jUuFHB2m4M
3433       hello                        ASL Vocabulary hello.mp4      0     360    640  29.97     30    80  https://www.youtube.com/watch?v=9jUuFHB2m4M
(94, 9)


In [28]:
# Export as csv file 

train_df.to_csv('../data/msasl_train.csv', index = False)

---
###  1.4.3 Testing and Validation Datasets


Microsoft ASL separates their files into train, test, and validation json files. We will do the exact same process for the training dataset earlier into the testing and validation files for modeling. 

---
### 1.4.3.1 Microsoft-ASL *Test* dataset
---

In [29]:
# import testing dataframe from MSASL
test_df = pd.read_json('../data/MS-ASL/MSASL_test.json')


#### Start Exact cleaning steps as Training Data ####

# some files do not have https so we will check and include 'https://' into the url
urls = test_df['url'] # create a list of all urls from the dataframe

# use a for loop to add a prefix 'https://' to the front of the urls if
# the url starts with 'www.'
for i, url in enumerate(urls):
    if not url.startswith('https://'):
        url = 'https://' + url
        test_df.at[i,'cleaned_url'] = url # create a new column header and append the url there
    else:
        test_df.at[i,'cleaned_url'] = url # append the current url to the new header

# as the file names inside the dataframe do not have '.mp4' and have trailing whitespaces,
# we will clean the values in the 'file' column 

test_df = test_df.copy()
test_df['file'] = [file.strip() + '.mp4' for file in test_df['file']]    

#### COMPLETED ####   

print(test_df.head()) # inspect first 5 rows of the test_df before subsetting

    org_text clean_text  start_time  signer_id  signer  start  end                      file  label  height     fps  end_time                                  url       text                                                                                 box  width  review                                  cleaned_url
0       beer       beer         0.0         20      40      0   59   SignSchool Beer var.mp4    805     360  29.970     1.969  www.youtube.com/watch?v=wX78EPtSuzU       beer                     [0.046787232160568, 0.290409207344055, 1.0, 0.8233144283294671]    640     NaN  https://www.youtube.com/watch?v=wX78EPtSuzU
1      enjoy      enjoy         0.0         11      39      0   54              enjoy(1).mp4    192     360  29.970     1.802  www.youtube.com/watch?v=OL02Odh2dRg      enjoy                     [0.071450918912887, 0.136024981737136, 1.0, 0.9055149555206291]    480     NaN  https://www.youtube.com/watch?v=OL02Odh2dRg
2  emotional  emotional         0.0         20

In [30]:
# create a cleaner dataframe with the headers that will be needed

filtered_test_df = test_df[['clean_text',  'file', 'label', 'height', 'width', 'fps', 'start', 'end',  'cleaned_url']]

# Sort the DataFrame based on the 'clean_text' column
filtered_test_df_sorted = filtered_test_df.sort_values(by='clean_text')
print(filtered_test_df_sorted.head())

     clean_text                          file  label  height  width    fps  start   end                                  cleaned_url
2354       #asl                       ASL.mp4    309     720   1280  30.00      0    49  https://www.youtube.com/watch?v=DFcoHE0WC5c
530         #ok  ASL2 Class 45 Food SIGNS.mp4    714     720   1280  29.97   3395  3445  https://www.youtube.com/watch?v=qATuIbBufYQ
3616        #ok  ASL3 Class 21 Vocabulary.mp4    714     720   1280  29.97    957  1011  https://www.youtube.com/watch?v=5LuS-TTg7rA
4089        100            SignSchool 100.mp4    849     360    640  29.97      0    28  https://www.youtube.com/watch?v=I87rZdO7SPY
2915        100          SignSchool 100 2.mp4    849     360    640  29.97      0    27  https://www.youtube.com/watch?v=vBssGGCcpfE


In [31]:
test_df = filtered_test_df_sorted[filtered_test_df_sorted['clean_text'].isin(first_three)]

print(test_df.head()) # inspect dataframe
print(test_df.shape) # 33 video links for the 3 words

     clean_text                               file  label  height  width    fps  start  end                                  cleaned_url
1946      hello        ASL1 Class 48 New SIGNS.mp4      0     720   1280  29.97     84  150  https://www.youtube.com/watch?v=kVIt-qr5P2s
3393      hello  ASL Top ASL Signs for Servers.mp4      0     360    360  29.97     11   51  https://www.youtube.com/watch?v=FBimL8-ND3E
3238      hello       ASL3 Class 23 Vocabulary.mp4      0     720   1280  29.97     92  124  https://www.youtube.com/watch?v=eLpHf9bfdsI
1275      hello        ASL1 Class 29 New SIGNS.mp4      0     720   1280  29.97    106  136  https://www.youtube.com/watch?v=H3eHAKpTF70
3394      hello  ASL Top ASL Signs for Servers.mp4      0     360    360  29.97    101  161  https://www.youtube.com/watch?v=FBimL8-ND3E
(33, 9)


In [32]:
# Export as csv file 

test_df.to_csv('../data/msasl_test.csv', index = False)

---
### 1.4.3.2 Microsoft-ASL *Validation* Dataset
---

We will inspect both dataframes and if we can observe that the columns organizations are identical to the MS-ASL train datase, we will  clean the `urls` and add a '.mp4' as a suffix to the `file` name. This will be the exact step as the train dataset for MS-ASL. Lastly, we will subset the dataframe for better redeability and have it exported out.

In [33]:
# import validation dataframe from MSASL
val_df = pd.read_json('../data/MS-ASL/MSASL_val.json') 

#### Start Exact cleaning steps as Training Data ####

# some files do not have https so we will check and include 'https://' into the url
urls = val_df['url'] # create a list of all urls from the dataframe

# use a for loop to add a prefix 'https://' to the front of the urls if
# the url starts with 'www.'
for i, url in enumerate(urls):
    if not url.startswith('https://'):
        url = 'https://' + url
        val_df.at[i,'cleaned_url'] = url # create a new column header and append the url there
    else:
        val_df.at[i,'cleaned_url'] = url # append the current url to the new header

# as the file names inside the dataframe do not have '.mp4' and have trailing whitespaces,
# we will clean the values in the 'file' column 

val_df = val_df.copy()
val_df['file'] = [file.strip() + '.mp4' for file in val_df['file']]    

#### COMPLETED ####   

print(val_df.head()) # inspect first 5 rows of the df

   org_text clean_text  start_time  signer_id  signer  start  end              file  label  height     fps  end_time                                          url      text                                                                                 box  width  review                                  cleaned_url
0    absent     absent         0.0        114      -1      0   37    ASL ABSENT.mp4    837     360  28.971     1.277  https://www.youtube.com/watch?v=ri3NrdgfAtE    absent                   [0.21896389126777602, 0.008568197488784, 0.9978311061859131, 1.0]    202     NaN  https://www.youtube.com/watch?v=ri3NrdgfAtE
1      help       help         0.0         76      42      0  110        help 2.mp4     50     360  29.970     3.670          www.youtube.com/watch?v=l31UXgChCS4      help                   [0.050372719764709, 0.29941257834434504, 1.0, 0.6968145966529841]    640     NaN  https://www.youtube.com/watch?v=l31UXgChCS4
2   come on    come on         0.0        114      -

In [34]:
# create a cleaner dataframe with the headers that will be needed

filtered_val_df = val_df[['clean_text',  'file', 'label', 'height', 'width', 'fps', 'start', 'end',  'cleaned_url']]

# Sort the DataFrame based on the 'clean_text' column
filtered_val_df_sorted = filtered_val_df.sort_values(by='clean_text')
print(filtered_val_df_sorted.head())

     clean_text                                            file  label  height  width     fps  start   end                                  cleaned_url
3361       #asl                                      ASL(9).mp4    309     360    640  25.079      0    99  https://www.youtube.com/watch?v=72Iq9Cc0rcU
5135        #ok                                       OK(2).mp4    714     360    640  30.000      0   123  https://www.youtube.com/watch?v=eXNd79kbUIg
3358        100  Numbers 1 to 100 in American Sign Language.mp4    849     360    640  29.970   6108  6155  https://www.youtube.com/watch?v=6r8HDsBMk1E
1476        100  Counting 1-100 in ASL with Base Ten Blocks.mp4    849     360    640  29.970   5973  6021  https://www.youtube.com/watch?v=iJwNP8oNphY
3261        100                              ASL I - Unit 8.mp4    849     720   1280  24.000   8111  8197  https://www.youtube.com/watch?v=VcRx3U-HpTE


In [35]:
val_df = filtered_val_df_sorted[filtered_val_df_sorted['clean_text'].isin(first_three)]

print(val_df.head()) # inspect dataframe
print(val_df.shape) # 34 video files for the 3 words

     clean_text                                              file  label  height  width    fps  start    end                                  cleaned_url
2048      hello                                        Unit 1.mp4      0     360    640  29.97  12647  12830  https://www.youtube.com/watch?v=BEKUHzwnVO8
848       hello         Mr Miseners ASL lesson 1 vocab review.mp4      0     720   1280  30.00   2149   2263  https://www.youtube.com/watch?v=kZuJalsv_z0
1478      hello                               Social Work Voc.mp4      0     720   1280  30.00      6     39  https://www.youtube.com/watch?v=V8DTIPK4QlU
3590      hello                     ASL I - Unit 1 Vocabulary.mp4      0     720   1280  24.00   2240   2316  https://www.youtube.com/watch?v=2VB3WN8adyM
867       hello  Greetings  Answers in American Sign Language.mp4      0     360    640  29.97     14     73  https://www.youtube.com/watch?v=pD4QabDQr6M
(43, 9)


In [36]:
# Export as csv file 

val_df.to_csv('../data/msasl_val.csv', index = False)

## 1.5 Video Downloads


The links provided are from YouTube which we are able to download videos via the pytube API. We can create a function to automate the downloads for each word in the respective datasets. There is an API cap request which we of about 700 videos which we will not be hitting the limit. 


In [67]:
# create a function to help us automate the video downloading process

def download_videos(some_dataframe: pd.DataFrame, word_label: str, save_folder: str):
    '''
    Downloads videos from a DataFrame based on a specific label and saves them to a directory.

    Args:
    some_dataframe: A pandas DataFrame containing video URLs in a 'cleaned_url' column.
    word_label: The label of the video category to download.
    save_folder: Training, Testing, or Validation Data

    Returns:
    A string indicating the number of videos downloaded successfully.
    '''
    # Create a filtered dataframe from the input dataframe to only extract videos from the given word_label
    temp_df = some_dataframe[some_dataframe['clean_text'] == word_label]

    # Save path of where the videos will be downloaded
    SAVE_PATH = os.path.join('../videos/', save_folder + '/' + word_label + '/')
    os.makedirs(SAVE_PATH, exist_ok=True)  # Ensure the output directory exists

    # Create local variables
    downloaded_vids = 0
    count = 0
    total = len(temp_df)

    for index, row in temp_df.iterrows():
        # Isolate the URL from the row
        url = row['cleaned_url']

        # Use a try-except block
        try:
            yt = YouTube(url)
            mp4_streams = yt.streams.filter(file_extension='mp4')

            # If video is not found
            if not mp4_streams:
                raise ValueError('No MP4 streams found')

            # Choose the first item in the list
            d_video = mp4_streams[0]

            d_video.download(output_path=SAVE_PATH)
            #print('File Downloaded Successfully')
            logging.info('File Downloaded Successfully: %s', url)

            downloaded_vids += 1
            count += 1

        except Exception as e:
            #print(f'Error downloading video: {url}, {e}')
            logging.error('Error downloading video from %s: %s', url, str(e))
            
            count += 1

        print(f'{count}/{total}')

    total_downloaded = f'{downloaded_vids}/{total} downloaded'
    return total_downloaded


In [68]:
# Use a for-loop to download all the videos
for word in first_three:
    
    download_videos(train_df, word, 'train_data') # download videos from train dataset
    download_videos(test_df, word, 'test_data')   # download videos from test dataset
    download_videos(val_df, word, 'val_data')     # download videos from validation dataset

INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=6UrcyZ-QeiU


1/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=6UrcyZ-QeiU


2/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=TDoqoJDX280


3/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=rkQZQhloXuE


4/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=p36hZJQpLoQ: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>


5/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=SI_7UivPW_I: SI_7UivPW_I is unavailable


6/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=rkQZQhloXuE


7/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FKp2hBhdj9k


8/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=wNS3kifA8-Y: wNS3kifA8-Y is unavailable


9/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=Fz1I_Ssr3AE


10/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=XtkDeYBnR8o


11/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=eXpXg4q-qEQ&t=170s: eXpXg4q-qEQ is a private video


12/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=-LB4ENHxcIs


13/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=nvaS2J5mcEk


14/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=7YYB3BEoksc: 7YYB3BEoksc is a private video


15/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=eyu0V3s1-XA: eyu0V3s1-XA is a private video


16/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=UoITyyziLOw


17/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=Doxjf4SCY4k


18/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=hjS0dQDgbjo


19/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=yG09SQB0Hds


20/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=_rviR_jhCmg


21/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=lSxAVvE9sPc: lSxAVvE9sPc is a private video


22/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=Q6CfqYwp1ag


23/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=Q6CfqYwp1ag


24/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=6XcxbPfP5YQ


25/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=XtkDeYBnR8o


26/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=SI_7UivPW_I: SI_7UivPW_I is unavailable


27/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=ygpjKBrb91s: ygpjKBrb91s is unavailable


28/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=Eq6SnaimpzQ


29/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=Fz1I_Ssr3AE


30/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=7YYB3BEoksc: 7YYB3BEoksc is a private video


31/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=EKHye1RrCy0


32/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=p8OYydc3WQM: p8OYydc3WQM is age restricted, and can't be accessed without logging in.


33/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=Q6CfqYwp1ag


34/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=TDoqoJDX280


35/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=MDQQ9WZNwcc


36/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=7YYB3BEoksc: 7YYB3BEoksc is a private video


37/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=ygpjKBrb91s: ygpjKBrb91s is unavailable


38/39


ERROR - Error downloading video from https://www.youtube.com/watch?v=7YYB3BEoksc: 7YYB3BEoksc is a private video


39/39


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=bXyG4iebEBo


1/8


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FwGL9FjFpYI&t=1s


2/8


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=xbVVThaez2g


3/8


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=RMUydgQr9TE


4/8


ERROR - Error downloading video from https://www.youtube.com/watch?v=RHeFBNVYii4: RHeFBNVYii4 is a private video


5/8


ERROR - Error downloading video from https://www.youtube.com/watch?v=LTVv46_N764: LTVv46_N764 is unavailable


6/8


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FBimL8-ND3E


7/8


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FBimL8-ND3E


8/8


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=HUMEcnkvhJU&t=862s


1/10


ERROR - Error downloading video from https://www.youtube.com/watch?v=S_LEeTt2McM: S_LEeTt2McM is unavailable


2/10


ERROR - Error downloading video from https://www.youtube.com/watch?v=9g-hioQapYE: 9g-hioQapYE is age restricted, and can't be accessed without logging in.


3/10


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=cxXEULq9Jpc


4/10


ERROR - Error downloading video from https://www.youtube.com/watch?v=gQKqDXE-LJ8: gQKqDXE-LJ8 is a private video


5/10


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=HUMEcnkvhJU&t=862s


6/10


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=sQELZPmche8


7/10


ERROR - Error downloading video from https://www.youtube.com/watch?v=iDWntsY_rZM: iDWntsY_rZM is unavailable


8/10


ERROR - Error downloading video from https://www.youtube.com/watch?v=iDWntsY_rZM: iDWntsY_rZM is unavailable


9/10


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=2VB3WN8adyM


10/10


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=siQnKSRRg_c


1/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=lTLlauTBuVQ


2/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=KRrKqGEGdMg


3/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=0nTuxPL7RGg


4/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=rkQZQhloXuE


5/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=yhBt5YS_2L4&t=261s: yhBt5YS_2L4 is a private video


6/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=eXpXg4q-qEQ&t=170s: eXpXg4q-qEQ is a private video


7/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=zCUZOnKoKWQ: zCUZOnKoKWQ is unavailable


8/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=lSxAVvE9sPc: lSxAVvE9sPc is a private video


9/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=lSxAVvE9sPc: lSxAVvE9sPc is a private video


10/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=siQnKSRRg_c


11/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=MDQQ9WZNwcc


12/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=Cgh1DXAQBuI: Cgh1DXAQBuI is unavailable


13/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=p8OYydc3WQM: p8OYydc3WQM is age restricted, and can't be accessed without logging in.


14/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=UoITyyziLOw


15/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=siQnKSRRg_c


16/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=yJePAnyVcqg: yJePAnyVcqg is unavailable


17/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=yJePAnyVcqg: yJePAnyVcqg is unavailable


18/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FKp2hBhdj9k


19/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=shPBfkIYYpU


20/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=cIjUDNEYWTE


21/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=W1gUtNUjWdI: W1gUtNUjWdI is a private video


22/25


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=4Nh1iFv2BMc


23/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=SI_7UivPW_I: SI_7UivPW_I is unavailable


24/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=SI_7UivPW_I: SI_7UivPW_I is unavailable


25/25


ERROR - Error downloading video from https://www.youtube.com/watch?v=BL90-ejIQ9Y: BL90-ejIQ9Y is a private video


1/5


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=SJAhRxI7i9c


2/5


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FBimL8-ND3E


3/5


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FBimL8-ND3E


4/5


ERROR - Error downloading video from https://www.youtube.com/watch?v=-hzoIOWYJac: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>


5/5


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=sQELZPmche8


1/19


ERROR - Error downloading video from https://www.youtube.com/watch?v=iDWntsY_rZM: iDWntsY_rZM is unavailable


2/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=DGWErPoCHqw


3/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=HUMEcnkvhJU&t=862s


4/19


ERROR - Error downloading video from https://www.youtube.com/watch?v=3A0iSc2umlg: 3A0iSc2umlg is unavailable


5/19


ERROR - Error downloading video from https://www.youtube.com/watch?v=9g-hioQapYE: 9g-hioQapYE is age restricted, and can't be accessed without logging in.


6/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=cxXEULq9Jpc


7/19


ERROR - Error downloading video from https://www.youtube.com/watch?v=3A0iSc2umlg: 3A0iSc2umlg is unavailable


8/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=g59_3vr_NT0


9/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=HUMEcnkvhJU&t=862s


10/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=2VB3WN8adyM


11/19


ERROR - Error downloading video from https://www.youtube.com/watch?v=BJadr-8uX5U: BJadr-8uX5U is age restricted, and can't be accessed without logging in.


12/19


ERROR - Error downloading video from https://www.youtube.com/watch?v=BJadr-8uX5U: BJadr-8uX5U is age restricted, and can't be accessed without logging in.


13/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=M1CrNP2izNg


14/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=_LGqHR3FhL8


15/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=cxXEULq9Jpc


16/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=Xq3QAqQXpTM


17/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=_LGqHR3FhL8


18/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=cxXEULq9Jpc


19/19


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=yG09SQB0Hds


1/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=7YYB3BEoksc: 7YYB3BEoksc is a private video


2/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=7YYB3BEoksc: 7YYB3BEoksc is a private video


3/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=9jUuFHB2m4M: 9jUuFHB2m4M is unavailable


4/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=9jUuFHB2m4M: 9jUuFHB2m4M is unavailable


5/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=gU5EI_ZTzxw


6/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=yG09SQB0Hds


7/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=JHFExZgYdwk&t=4s


8/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=75BkNdtDsoQ


9/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=WbkSmhKTltU: WbkSmhKTltU is unavailable


10/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=p36hZJQpLoQ


11/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=wRoGQu75wzw: wRoGQu75wzw is unavailable


12/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=wRoGQu75wzw: wRoGQu75wzw is unavailable


13/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=WxY7E9P46DM


14/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=3zoqSvF0Z2A


15/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=dbzKXsyAcvY: dbzKXsyAcvY is unavailable


16/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=Cgh1DXAQBuI: Cgh1DXAQBuI is unavailable


17/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FVjpLa8GqeM


18/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=QB44Vddoi-w


19/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=sEjaTTIdHhk


20/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=25ymRY7hbjs


21/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=Eq6SnaimpzQ


22/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=97WDRBCtj0o


23/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=7YYB3BEoksc: 7YYB3BEoksc is a private video


24/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=p8OYydc3WQM: p8OYydc3WQM is age restricted, and can't be accessed without logging in.


25/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=uPKOwziViaw


26/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=7iuyJ84wvds


27/30


ERROR - Error downloading video from https://www.youtube.com/watch?v=D7UYn37qTkU: D7UYn37qTkU is unavailable


28/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FbBdHdmt5oI


29/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=RRzbB2rg4VM


30/30


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=kVIt-qr5P2s


1/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FBimL8-ND3E


2/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=eLpHf9bfdsI


3/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=H3eHAKpTF70


4/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FBimL8-ND3E


5/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=2qLS_mXdrh8


6/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=4bVux9Ni2_0


7/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=bhSbk9eQPu4


8/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=FwGL9FjFpYI&t=1s


9/20


ERROR - Error downloading video from https://www.youtube.com/watch?v=3qKcPONtNXQ: 3qKcPONtNXQ is unavailable


10/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=HBQaI1ZG-LU


11/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=exk-qIRNw7Y


12/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=2TQMVwOkHWw


13/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=-DMFu-nAsok


14/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=K8T0-zeB3jc


15/20


ERROR - Error downloading video from https://www.youtube.com/watch?v=6kvCOzxP9_A: 6kvCOzxP9_A is a private video


16/20


ERROR - Error downloading video from https://www.youtube.com/watch?v=oRsVMK7sxMQ: oRsVMK7sxMQ is a private video


17/20


ERROR - Error downloading video from https://www.youtube.com/watch?v=RMUydgQr9TE: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>


18/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=xQR5y_xUtpA


19/20


ERROR - Error downloading video from https://www.youtube.com/watch?v=ZgUBU0NNEDU: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>


20/20


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=BEKUHzwnVO8


1/14


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=kZuJalsv_z0


2/14


ERROR - Error downloading video from https://www.youtube.com/watch?v=V8DTIPK4QlU: V8DTIPK4QlU is age restricted, and can't be accessed without logging in.


3/14


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=2VB3WN8adyM


4/14


ERROR - Error downloading video from https://www.youtube.com/watch?v=pD4QabDQr6M: pD4QabDQr6M is unavailable


5/14


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=xmM5-XsE8Ac


6/14


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=0v4Y9WGhES8


7/14


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=HUMEcnkvhJU&t=862s


8/14


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=HUMEcnkvhJU&t=862s


9/14


INFO - File Downloaded Successfully: https://www.youtube.com/watch?v=0v4Y9WGhES8


10/14


ERROR - Error downloading video from https://www.youtube.com/watch?v=iDWntsY_rZM: iDWntsY_rZM is unavailable


11/14


ERROR - Error downloading video from https://www.youtube.com/watch?v=iDWntsY_rZM: iDWntsY_rZM is unavailable


12/14


ERROR - Error downloading video from https://www.youtube.com/watch?v=S_LEeTt2McM: S_LEeTt2McM is unavailable


13/14


ERROR - Error downloading video from https://www.youtube.com/watch?v=9g-hioQapYE: 9g-hioQapYE is age restricted, and can't be accessed without logging in.


14/14


---
## 1.6 World American Sign Language Dataset
---



Given that there are not many video files in the Microsoft-ASL dataset, we will be combining it with another dataset from [Kaggle](https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed) which has 12,000 processed videos of different signs.

In [69]:
# read in wlasl dataset
wlasl = pd.read_json('../data/WLASL/WLASL_v0.3.json')


print(wlasl.head())

      gloss                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

As the wlasl dataframe is deeply nested it will need to be extracted and assigned into the individual columns

In [70]:
# create a DataFrame for row details from all instances

# intialize empty list to append
video_details = []


for index, row in wlasl.iterrows():

    # rename each of the default labels
    for instance in row['instances']:
        video_details.append({
            'clean_text': row['gloss'],         # the word or sign action
            'url': instance['url'],              # URL of the video
            'bbox': instance['bbox'],           # coordinates of the bounding box
            'fps': instance['fps'],             # the frames per second
            'start': instance['frame_start'],   # frame start
            'end': instance['frame_end'],       # frame end
            'signer_id': instance['signer_id'], # ID of the signer
            'file': instance['video_id'] + '_'  # name of video file, which we will add an underscore to simplify file name extraction
        })

# convert to pandas dataframe
wlasl_df = pd.DataFrame(video_details)

# inspect dataframe
print(wlasl_df.head())

  clean_text                                                                                             url                 bbox  fps  start  end  signer_id    file
0       book                                                    http://aslbricks.org/New/ASL-Videos/book.mp4  [385, 37, 885, 720]   25      1   -1        118  69241_
1       book  https://aslsignbank.haskins.yale.edu/dictionary/protected_media/glossvideo/ASL/BO/BOOK-418.mp4  [190, 25, 489, 370]   25      1   -1         90  65225_
2       book                                                     https://www.youtube.com/watch?v=0UsjUE-TXns   [262, 1, 652, 480]   25      1   -1        110  68011_
3       book                                                     https://www.youtube.com/watch?v=1QOYOZ3g-aY  [123, 19, 516, 358]   25      1   60        113  68208_
4       book                                                     https://www.youtube.com/watch?v=aGtIHKEdCds   [95, 0, 1180, 720]   25      1   -1        109  68012_


In [71]:
# create a cleaner dataframe with the headers that will be needed

filtered_wlasl_df = wlasl_df[['clean_text', 'file', 'fps', 'start', 'end']]

# Sort the DataFrame based on the 'clean_text' column
filtered_wlasl_df_sorted = filtered_wlasl_df.sort_values(by='clean_text')

print(filtered_wlasl_df_sorted.head())

      clean_text    file  fps  start  end
15742          a  01615_   25      1   -1
15735          a  00295_   25      1   23
15736          a  66039_   25      1   -1
15737          a  01610_   25      1   -1
15738          a  01611_   25      1   -1


In this dataframe:
- `clean_text`: the signed word and renamed to follow the MS-ASL naming 
- `file`: the name of the video file
- `fps`: the number of frames per second in the video clip
- `start`: that starting frame of the sign action
- `end`: the end of the sign action. if the value is `-1` it means the entire length of the clip

In [72]:
# to visually see how extensive the dataset is

# Initialize defaultdict for word counts
word_counts = defaultdict(int)

# Iterate over each row in the DataFrame
for text in wlasl_df['clean_text']:
    # Split text into words
    words = text.split()
    # Increment count for each word
    for word in words:
        word_counts[word] += 1

# Sort word counts by count value in descending order
sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

# Print word counts
total_words = 0
for word, count in sorted_word_counts:
    print(f'{word}: {count}')
    total_words +=1

print(f'''
      
We can see that there is a total number of {total_words} different words in this dataset''')


up: 58
book: 40
last: 39
hearing: 37
drink: 35
year: 35
take: 33
computer: 30
you: 30
all: 29
language: 29
hot: 28
now: 28
give: 28
school: 28
want: 28
dog: 27
before: 26
chair: 26
go: 26
clothes: 25
who: 25
paper: 25
every: 25
candy: 24
room: 24
week: 24
sign: 24
cousin: 23
deaf: 23
day: 23
here: 23
south: 23
fine: 22
help: 22
no: 22
thin: 22
walk: 22
yes: 22
new: 22
president: 22
hard: 22
pay: 22
black: 21
cool: 21
finish: 21
like: 21
many: 21
mother: 21
orange: 21
table: 21
thanksgiving: 21
what: 21
woman: 21
heart: 21
not: 21
off: 21
high: 21
love: 21
bed: 20
blue: 20
bowling: 20
can: 20
family: 20
fish: 20
graduate: 20
hat: 20
kiss: 20
later: 20
man: 20
shirt: 20
study: 20
tall: 20
white: 20
wrong: 20
bear: 20
from: 20
get: 20
toilet: 20
america: 20
face: 20
in: 20
accident: 19
apple: 19
bird: 19
change: 19
color: 19
corn: 19
cow: 19
dance: 19
dark: 19
doctor: 19
eat: 19
enjoy: 19
forget: 19
meet: 19
pink: 19
pizza: 19
play: 19
secretary: 19
short: 19
time: 19
work: 19
for: 19
tue

Looking at the dataset from WL-ASL, the most number of videos available is `up` with 58 videos while `caterpillar` has the least number of 6 videos. 
Even though there are a total of 1983 videos of signed words available, it is a compiled list. Whereas, the MS-ASL dataset has split into train, test, and validation dataset. In light of this, there is no need to conduct a train-test split later as we will be processing the videos according to the how the dataframe is structured. All the videos from WL-ASl will be contributed to the training data and will be moved to the MS-ASL train folder. From the next notebook, there will not be any referenced to WS-ASL or MS-ASL datasets.

In [73]:
# filter the WL-ASL to only contain the words we need

wlasl_df = filtered_wlasl_df_sorted[filtered_wlasl_df_sorted['clean_text'].isin(first_three)]

print(wlasl_df.head()) # inspect dataframe
print(wlasl_df.shape) # 42 video files that are relevant

     clean_text    file  fps  start   end
6505      hello  27177_   25      1    -1
6504      hello  27175_   25      1    -1
6493      hello  27171_   25   1297  1343
6494      hello  70017_   25   4842  4923
6495      hello  68236_   25      1    36
(42, 5)



Now, we will move the files to the train data video folder. 

In [76]:
# we will be using a loop to retrieve the video_id of the entry that has the label inside the list of updated words

for word in first_three:
    for index, row in wlasl_df.iterrows():
        if word == row['clean_text']:
            #print(word, row['clean_text'], row['file'])  # debug
            os.makedirs('../videos/wlasl_data/' + word, exist_ok=True)                       # creates a folder 

            try:
                # make a copy of the video file to the newly made folder
                src = os.path.join('../videos/wlasl_videos/' + row['file'] + '.mp4') # source file location
        
                dst = os.path.join('../videos/wlasl_data/' + word )                          # destination location

                
                shutil.copy(src, dst)                                                   # creates a copy of the file

            except Exception as e:
                continue


In [77]:
# Export out as csv file for easier import
wlasl_df.to_csv('../data/wlasl.csv', index=False)