<a href="https://colab.research.google.com/github/sqhang/ClarifAI/blob/main/non_GPT_QA_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Audio to text: Whisper

In [None]:
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg
# pydub is used to convert .m4a to .mp3
!pip install pydub
!pip install openai
!pip install transformers
!pip install ipython

import json
from pydub import AudioSegment

In [None]:
import openai
import tensorflow_datasets as tfds
from jiwer import wer
import numpy as np
import requests

# Load the TED-LIUM dataset
tedlium_dataset = tfds.load('tedlium')
test_data = tedlium_dataset['test']

# Initialize the OpenAI API
openai.api_key = "your_openai_api_key"

# Set up the Whisper ASR API endpoint
whisper_url = "https://api.openai.com/v1/whisper/asr"

# Function to transcribe audio using the Whisper ASR API
def transcribe_audio(audio):
    response = requests.post(
        whisper_url,
        headers={"Authorization": f"Bearer {openai.api_key}"},
        files={"file": audio},
    )
    response.raise_for_status()
    return response.json()["transcript"]

# Function to compute Word Error Rate
def compute_wer(true_transcript, predicted_transcript):
    return wer(true_transcript, predicted_transcript)

# Set up variables for WER calculation
num_samples = 100  # You can adjust this number as needed
wer_values = []

# Iterate through the test dataset and compute WER
for i, sample in enumerate(test_data.take(num_samples)):
    audio = sample['audio'].numpy().tobytes()
    true_transcript = sample['transcript'].numpy().decode('utf-8')
    
    try:
        predicted_transcript = transcribe_audio(audio)
        sample_wer = compute_wer(true_transcript, predicted_transcript)
        wer_values.append(sample_wer)
        print(f"Sample {i + 1}: WER = {sample_wer:.2f}")
    except Exception as e:
        print(f"Error in sample {i + 1}: {e}")

# Calculate the average WER for the test dataset
average_wer = np.mean(wer_values)
print(f"Average WER for {num_samples} samples: {average_wer:.2f}")


In [None]:
!wget https://us.openslr.org/resources/51/TEDLIUM_release-3.tgz

In [None]:
!tar -xzf /content/drive/MyDrive/TEDLIUM_release-3.tgz -C /content/

In [None]:
!whisper '/content/TEDLIUM_release-3/data/sph/911Mothers_2010W.sph' --model medium.en

In [None]:
!ffmpeg -i /content/TEDLIUM_release-3/data/sph/911Mothers_2010W.sph -vn -acodec libmp3lame -y /content/911Mothers_2010W.mp3

In [None]:
import json
import os

def convert_m4a_to_mp3(audio_file_name):
    audio = AudioSegment.from_file(audio_file_name, 'm4a')
    output_file_name = audio_file_name[:-4] + '.mp3'
    audio.export(output_file_name, 'mp3')

In [None]:
FILE_NAME_m4a = '646 desktop record.m4a'
FILE_NAME_mp3 = '646 desktop record.mp3'

In [None]:
convert_m4a_to_mp3(FILE_NAME_m4a)

In [None]:
!whisper '646 desktop record.mp3' --model medium.en

In [None]:
with open("646 desktop record.json", 'r') as f:
    data = json.load(f)

print(data['text'])

### transcript 1: phil recording

 Coherent set of attitudes. I haven't contradicted myself. But rationality is not a matter of not contradicting yourself. There are lots of ways to not contradict yourself that are plainly irrational. And so the mere fact that this is a coherent set of attitudes is not enough to show that we're being rational. Here's another way you could do it. Think about a T2 star, which I just need to be a variant on T2. So Sam comes in, Sam says that it's snowing, so I become highly confident that Sam said that it's snowing. Suppose I want to, I still think, suppose I'm a weather channel junkie and I'm very confident that it doesn't snow in Houston. Another way I could maintain my coherence is to give up on my confidence in Sam's veracity. I could think, well, Sam said it's snowing, but it's definitely not snowing. So I guess Sam is not as reliable as I thought he was. That's another way to maintain coherence. Now, Hume's idea, though, is that this is a bit different. We shouldn't do either of these things. Hume's idea is that if I'm highly confident in this and then if I started out highly confident in this, I'm sorry, if I started out no attitude here, highly confident that this is true, highly confident this is false, and then I become highly confident this is true, Hume thinks that there's a kind of, as we'll see on each of the quote in a bit here, there's a kind of mutual destruction between my confidence in the truth of two and my confidence in the falsehood of three. And so what should happen is I should become less confident in two. Instead of being highly confident, I should be, I don't know, moderately confident. And I should become much more confident in three. So instead of being very confident that it's false, I should be having a moderate degree of confidence. And so Hume's idea here is that once I become highly confident that one is true, I should become medium confident in both two and three. OK. Any questions about the proposal before we leave this yet? I'm just a little bit confused, like, logic-wise, how we have, like, half-competences on something. Yeah. Good. So if you've taken a logic class, then you'll recognize that this is a pattern that you're not going to be able to see. And you're not going to be able to see the pattern that you're going to be able to see. And then you'll recognize that this is a pattern. So this is the basic puzzle, is that we have three sentences, three propositions, and they can't all be true. This is an inconsistent trial. And so the truth of any two of the sentences guarantees the falsity of the other. And the situation we're imagining is where you're highly confident that one is true, no attitude towards two and three, and then you become highly confident that one is true as well. And now that requires you to adopt a high confidence in three. OK, now if you're used to thinking in terms of logic, in terms of the way you study these things in a logic class, you don't think about these sentences in terms of truth or in terms of how confident you can be that they're true. You think about them in terms of whether they are true or false. Because in logic, logic is what we call an alethic enterprise. It is where we trace out the relationship between the truth of one sentence and the truth of another sentence. What humans do in here is not logic. What humans do in here is epistemology. And epistemology is different from logic in that we're not talking directly about what is true and what is false. We're talking about what attitudes it is appropriate to adopt or rational to adopt. What should I believe? And how confident should I be in that belief? And that does not track logic cleanly. Strangely enough, logic and epistemology are quite a distinct enterprise. There are many logical truths, many relations between the two values of sentences, which are such that it would be irrational for you to believe it. And there are many things that are rational for you to believe, which nonetheless fail to track logic and these sort of things. So the question we're asking ourselves is, what should I believe? And for Hume, as we'll see, the question's really, what should I believe given my evidence? And that's very different from the question of, what does logic guarantee to be true? So there's another bit of a suppressed point in here. So I'm talking about, what should I believe? And then I go back and I start talking about levels of confidence. What's going on there? Well, Hume is adopting a specific view about what a belief is. We often talk in this kind of inconsistent ways about belief. Sometimes we talk about, I believe this and I don't believe that, where belief is a kind of categorical attitude. Either I believe it or I don't. But sometimes we talk about belief as a kind of commitment towards the truth of a proposition, where that commitment comes in degrees. So you can be more or less confident in some proposition. I am extremely confident that Boston is the capital of Texas. I am pretty confident, confident that Montpelier is the capital of Vermont. I am not at all confident about, is Tulsa the capital of Oklahoma? I don't even know. Oklahoma, like, never been talked about in Oklahoma. I have no idea what the capital of Oklahoma is, right? So I have these different attitudes towards these different propositions about state capitals. So the idea here is that this is at least some indication that belief is not an all or nothing matter. It's not categorical in the sense that for a sentence, either I believe it or I don't. But instead, it's a kind of commitment which comes in degree. And if you make it to that second class of epistemology, you'll start talking about degrees as measured by real numbers between 0 and 1 with a lot of probabilities. And what sort of constraints you can impose on a formal representation of beliefs in terms of probabilities? I'm just talking very coarsely about mass. I'm going to talk about high confidence or low confidence. OK. So Hugh's idea, again, is that when you find yourself in this situation and then you obtain the testimony, if you don't change your attitude towards at least one of these two, then you wind up with incoherent beliefs. And yet, there are different ways to resolve that kind of inconsistency and coherence. And Hugh's point is that you should reduce your confidence in this one and increase your confidence in this one. And that is the rational set of attitudes to adopt once you obtain that testimony. OK. Now, let's be a little more fine grained about it, though. Because we know here in Houston, it's very unlikely that it's snowing right now. What if we change it? What if we go a step further here? What if instead of Sam saying that it's snowing, what if Sam said that 2 plus 3 equals 7? OK, so I start out. I am highly confident that Sam is a reliable testifier. I have a very low confidence that 2 plus 2 equals 7. I'm very confident that 2 plus 2 does not, in fact, equal 7. And so my expectation that Sam is going to come into the room, Sam, the reliable testifier, I'm very confident he is not going to walk into the room and tell me that 2 plus 3 equals 7. Then Sam walks into the room and says, 2 plus 2 equals 7. What should I now believe? Now, again, given Hugh's way of thinking about this, I can now be extremely confident that Sam did, in fact, walk into the room and say that 2 plus 2 equals 7. But if my confidence here started out very low, and my confidence here is high, and then I become highly confident that Sam is offering this testimony, the idea here is that if my confidence that 2 plus 2 equals 7 started out being extremely low, I'm just very confident that this is false, then I'm not going to wind up medium confident that this is true, and medium confident that this is true, and highly confident that this is true. I will wind up instead being something like this. So I will wind up being highly confident that Sam offered the testimony. I will have a low confidence in this. I'm sorry, a low confidence in this, and a low confidence in this. So the idea here is that I'm going to have to change my attitude towards Sam's reliability, because Sam said something that I regard as being obviously false. Yeah? Why would we never change the attitude towards 1? Isn't it possible to be hallucinating Sam? Yeah, yeah, good. I think probably you should say that. You can sort of generalize the human account and think, well, what should I believe? So I believe my perceptual experiences are more or less reliable. Slot that in for 2. And I also believe that there are no chihuahua-sized pink elephants in existence. And so I think it's very unlikely there's going to be a chihuahua-sized pink elephant walking through the room right now. But what happens if I have a visual experience of a chihuahua-sized pink elephant walking through the room right now? I could either, I mean, something's got to give. Maybe I lose confidence, if I become much less confident that my perceptual experiences are reliable, and much more confident that there is a pink chihuahua, at least one, in existence. Or maybe I just think, well, I couldn't have just had that experience of the pink elephants and chihuahuas. But Thiem doesn't actually say that for real. Yeah? This structure, does it only apply to very particular cases of testimony? So like, if in the future Sam comes in after the 2 plus 2 equals 7 incident and goes, it's snowing, is my confidence there effective? Yes, Thiem is aggravatingly imprecise on this question of exactly what slots in for this proposition. So he often talks about human veracity in general. And so if that's right, then when Sam testifies the 2 plus 2 equals 7, I lose faith in humanity. And so then when Jonathan comes in and says it's snowing, I think to myself, well, I lost some faith in humanity. And so I guess I should trust Jonathan's lies. Right? Again, Thiem is maddeningly imprecise on this question. So it's hard to know what to say. But if a person wanted to spell out a rigorous version of a human encounter testimony, it's precise, this sort of question you're trying to answer. OK, we're short on time. So let me charge forward and talk about how this goes. OK, so Thiem's big idea is this. He says, look, when you want to know what to believe after you find yourself in this sort of situation where you obtained testimony that you were not expecting to obtain, and now you have these inconsistent beliefs and you have to figure out the second side of that inconsistency and regain your coherence, he thinks that there are at least two different kinds of things to consider. The first thing to consider are facts about the testifiers specific to this basic question, or about the testimony in general, mostly about the testifier. But also things like, do other testifiers agree? Are other people saying that 2 plus 2 equals 7? Does the testifier have an interest in the outcome? Are they trying to sell you a used car? Because if they're trying to sell you a used car and they tell you it's a nice car and it's never going to break down, you should be skeptical just because they have an interest in you believing that it's a nice car. Is the person drunk? Are they an expert? These sorts of features about the person doing the testifying gives you some information about how you can reconcile the great consistency. But also, importantly, there are these questions about the testimony and about the type of proposition which is being testified about. That will give you some information about whether you should believe what the person has told you. If it's the sort of question about which there is lots of disagreement, then you should be less confident in it. So for example, if someone comes along and says, here's a piece of testimony. The best food on Earth is tacos. Now, tacos are delicious, we can all agree. But there's lots of disagreement about what's the best kind of food. Some people like tacos. Some people like other foods. I don't know, ketchup? And so maybe that's just the sort of matter about which testimony is generally unreliable. Maybe like preferences, things about aesthetic preferences are types of testimony which are unreliable. Maybe you think that claims about the distant past are unreliable. So Caesar famously crossed the Rubicon. Stepping out of the Rubicon, he either stepped out with his left foot or with his right foot. Imagine someone comes along, Sam goes along and says, guess what, Caesar stepped out with his left foot first. How confident should you be in what Sam just told you? You might think to yourself, well, that happened a long time ago. And as far as we know, nobody wrote down how Caesar exited the Rubicon. But of course, then, I think about it. And so how could you possibly know that? And so I shouldn't trust your testimony when you say it. OK, so big picture point. If you have a good reason to believe that this is false and then someone testifies that it's true, Hume's claim is that you shouldn't become highly confident that it's true on the basis of that testimony. At best, you should increase your confidence in this somewhat and decrease your confidence in this somewhat. OK, so I have to read this quote to you because it's one of the great Hume books of all time. So let me get a, OK. Now, what's the case in which you're maximally confident that three is false? Now, what would it mean to be maximally confident that three is false? Well, to be maximally confident that three is false, you have to have very strong evidence that it's false, or so says Hume. What does it mean to have very strong evidence that three is false? Well, it's to have uniform experience of the falsity of claims like three. What does it mean for something to be something that you have universally, without exception, experience to be the case? If you have a kind of regularity, which is such that you have, like, every time you experience the fire, you experience smoke every time, and then someone comes along and says, there's a fire, but no smoke, that's an exception to what you have observed to be an exceptionalist regularity. And so according to Hume, this is a claim which is as strongly inconsistent with your evidence as could be without logical contradiction. And so this is a case where you should be maximally confident that the person has spoken falsely. Put that differently, because this person is testifying to an exception to what you observe to be an exceptionalist regularity. This person has testified that a miracle has occurred in the way that Hume defines miracle. And so, like all cases of miracles, you should be maximally confident that the miracle has not actually occurred. Because, again, this continues with uniform experience. And so you should always be highly confident that the miracle has not occurred. Now, given that you should be maximally confident that the miracle has not occurred, if you have a miracle that's being testified to, then once you go and try to reconcile the inconsistency between your beliefs, you should end up with a very low confidence that the miracle has occurred, even if an otherwise reliable testifier has testified to its truth. So here's Hume explaining exactly what he meant. He says, a wise person proportions their belief to the evidence. In such conclusions, as are founded on infallible experience, they expect the event with the last degree of assurance and regard past experience as a full proof of the future existence of that event. In other words, your uniform experience of smoke being ruined by fire is what he's calling here a full proof that, in the next instance, fire will be improved by smoke. He goes on. He says, the plague consequence, and it is a general maxim worthy of our attention, that no testimony is sufficient to establish a miracle unless the testimony be of such a kind that its falsehood would be even more miraculous than the fact which it endeavors to establish. And even in that case, there is a mutual destruction of arguments, and the superior only gives us an assurance suitable to that degree of force, which remains after deducting the inferior. When anyone tells me that he saw the dead man restored to life, I immediately consider with myself whether it be more probable that this person should either deceive or be deceived, or the fact that which he relates should really have happened. I weigh the one miracle against the other, and according to the superiority which I discover, I pronounce my decision and always reject the greater miracle. If the falsehood of this testimony is more miraculous than the event which he relates, then, and not till then, can he pretend to command my belief or opinion. OK, so here's the point. Here's what he was going for. Suppose you have a person testifying that some miracle has occurred. Because it's a miracle, you should be maximally confident that it has not occurred, because all your experience indicates otherwise. Now, if they really do give that testimony, and you thought before that this was a reliable testifier, then you should be at least slightly more confident that the miracle has occurred, and probably much less confident that this person is a reliable testifier. And that's how you resolve your inconsistency of your beliefs. But think about the strongest possible case in which you get the best evidence that a miracle has occurred. That would be one in which you're maximally confident in the reliability of the testifier. What would it mean to be maximally confident in the reliability of the testifier? Well, it would mean that you've had the opportunity to obtain lots of testimony from this person, and then the opportunity to independently verify what this person says. And it has been your exception-less experience that this person speaks truly. Now, if in your experience, without exception, this person always speaks truly, then it would be, by definition, it would be a miracle for this person to speak falsely. I don't mean it to be a thing. I just mean it to be miraculous, as in an exception to an exception's regularity. And so in the case I'm imagining, a person has come along and offered some testimony. The thing they testified to is a miracle, and so you think it's almost certainly not true. But this person is a paragon of veracity, and so it would be a miracle for that person to speak falsely. And so now what you have here is a mutual destruction of arguments, a mutual destruction of miracles. At least one of the miracles has occurred. Either the paragon of veracity has spoken falsely, or there's smoke without fire, or a person rising from the dead, or whatever the miracle is. And on Hume's view, in this particular case, when you have miracle against miracle, they destroy each other in the sense that you might become slightly more confident in one over the other, but you probably shouldn't be very confident in either one. So when is it rational to believe that a miracle has occurred? Well, the only way it could be rational to believe that a miracle has occurred is if there's far more experience, far more uniform experience, for the veracity of the speaker than there is experience contrary to the existence of the miracle. So maybe I've seen five campfires in my life, and they all are under fire and smoke. It would be a miracle for me that there exists a fire without smoke. But suppose I attend campfire university full of campfire experts who go out and observe campfires for a living. And suppose they all get together one day, and they're just like, Brian, you've got to understand. There's this one campfire in France, and it burns without smoke, right? OK. Well, I mean, I'm sort of presupposing in this story that I have verified the veracity of each of these experts. But in that case, there are these two miracles. It's miraculous if these people are speaking falsely, and it would also be miraculous for them to be a campfire that's not accompanied by smoke, but to destroy each other. But because I'm imagining the evidence in favor of the veracity of these testifiers is stronger than my evidence against the existence of the fire of the smoke, now it would be rational for me to believe that there's a fire within the smoke. And in the case of a miracle, a non-religious miracle, as Hume would say, a miracle from the sciences, so a scientific breakthrough, a sort of revision to what had been thought to be a scientific principle, that's a miracle, to give you the sense. In that case, the only time it's rational to believe it, is on the basis of testimony, anyway, is if a sufficient number of sufficiently reliable testifiers all get together and say that it's true. OK. Now, that part, up till now, is the first section of section 10. The second part is aimed squarely at religious miracles, and I'm mostly just not going to talk about that. It's less interesting if you pursue it to your interest. But I will just tell you that what happens in the second section is that Hume comes along and he says, OK, it's at least possible that with non-religious miracles, it's sometimes rational to believe that the miracle has occurred on the basis of testimony, again, because a sufficient number of experts all get together and they're reliable enough, and there's enough of them, and they all say the same thing. And so even if it's inconsistent with my experience, now maybe it's rational for me to believe that there exists this fire of no smoke. But, says Hume, with religious miracles, there are some special reasons to never believe that a religious miracle has occurred on the basis of testimony. And he gives us a number of reasons, some of which, I think, they arrange for kind of mean-spirited to have been dated, especially for Hume writing in the 17th century. Anyway, his basic idea is just that there is a, here's some of the reasons he gives. He says, there is a human tendency to marvel at the miraculous and just be prejudiced, be gullible to believe that spectacular things have occurred. And so we should be on the lookout against that tendency in ourselves. He thinks that the sorts of people, so he's a kind of mean-spirited thing from Hume, the sorts of people who testify to the existence of religious miracles tend to be those who are not very trustworthy. He thinks, here's an actually interesting one. It's kind of a big and more mean-spirited, I think. But he thinks that there are lots of different people testifying to the religious miracles of different religions, different inconsistent religions, in the sense that if the one religion is true, the other one cannot be false. And so the testimony for a miracle of one religion, which supports the veracity of that religion's claims, is opposed by testimony for the miracle of another religion, which supports the broader claims of that religion. But since the two religions are inconsistent, testimony in the favor of one is testimony against the other, and vice versa. And hence, there's this kind of mutual destruction of all religions because of the support for the other religions. OK, so you can read that for yourself. That's probably not the most interesting part of the session. OK, great. So what I want to do now, the reason I'm rushing through this, I apologize, is I want to talk about papers that you'll have to write at the end of the semester here. I want to just give you a sense of what I'm looking for and how to write a great paper. But as a kind of warm-up to that, we're going to take a pop quiz. Please read the instructions carefully before you ask any questions. OK.


### transcript 2: 421 recording

 We'll be honest, we've already done it, but I'm interested in hearing more from you. So, here are my comments when we start talking about this. On Saturday, the idea that we're going to have to update the directory to secure the entry by entry. After you update the data, it's forgetting to be block by block. But once you have a getting block, when you input the directory to block, it's going to assume that they're entering by entry. At the end, they draw the same length, originally. Now we're going to allocate the space, and within the block, I'm going to have to calculate the space, what I call jump by jump. Jump by jump to jumps are bounded in both ways. That sign was chosen so that no matter what the actual sacred size is, no matter what the actual block size is, no shown camera stands across the sacred boundary or across the block boundary. What that means is you can always read the directory jumps in a single disk operation on the right of the directory jumps in a single disk operation. That means you're going to have a lot less of a problem with the potential of, for example, writing a jump that might stand across the block or a sacred boundary. If it does stand across the boundary, you're going to be writing a block right here, so you're going to have consistent crashes after writing the first half of the jump before writing the second half of the jump. Now, these jumps are written entirely on one operator, and they're expanded across the block and across the sacred boundary. Within the jump, the reality of the space is in very precise pieces. So these familiar things are a mild name. Here they add the null termination operator, a great reason to get exposed for convenience, and they allocated the item in there. But there's two other fields we talked previously about. On Thursday, we have the length of the name. The length of the length of the null termination basically tells you how large this field is or how many characters in this field are important or relevant. But the record length field tells you the length of the entire table. The record length field is independent of the name length field because the record may be longer than each city that represents just that name. If the whole size of the entire record would fall in the size of the name plus the size of the item number plus the size of the name length field, there would be no need for a record length field because the record length would be applied to the size of the other pieces. But the record length field being independent allows the entire record to be larger than it used to be. And the reason for that is the 512-byte chunks are always exactly full. Suppose you have your name that don't exactly go up the 512-byte and the last record in the chunk, the length of that record is longer than it used to be. Basically, that last record is as small as up the remaining space within the chunk to fill out the rest of that chunk again. When you put a directory entry in the class divergent, no matter what, you can change the item number to zero and that directory entry is effectively considered to be going to be technically still there, but being bored when you read over it is still catching the price of the size of the directory, but it does not determine the actual directory entry. And most of the time here we don't do that. Instead, because the name length field is not going to be perfectly independent, and you delete a directory entry, you simply take the directory entry and report it in the same chunk. If the chunk consists of a bunch of directories, you delete the directory entry out of the middle of that chunk, and you're created for that to be the same chunk. It's larger. You can increase the size of the small of the space that the directory you just deleted used to consume. Now again, the chunk is now 100% full. There's no entity space in the chunk that the directory you are going to use to be. You can't do that, though, if you delete the first directory entry of the chunk, because just to make the space that was left by the first directory entry were consumed into the previous directory entry. Now we're just going to be crossing the chunk down. We want to do that. As I said, for the first directory entry chunk, they still change the item number to zero. Let me show you an example of that. Here's a simple chunk that has three directory entries in it. Finally, it takes you to test us in, and we're going to make one, three, four. So the size of all these fields, and I'm going to show the exact number of examples here, the name length and the record on fields are twice each. That means that you could have a directory entry with the name of that directory entry being basically up to 65,000 characters long, and don't actually use it for that. The actual limitation is 255 characters. But I guess they sign in. Some of the signed decisions on the document have the exact reason for some of these decisions. Important ones, they document the less important ones, they just hit it. So I assume that the format that shows them could, inside some days, make the names being too long, 255 characters. But really, 255-character file names is pretty long enough. You want to really type, you know, CD to primary character name. Probably not. But the representation of the names is up to two to the 16th. So the two and two bytes, right? Two bytes each. The item number used to be two bytes. In my case, it's two bytes. The class name uses two bytes. Here's the item number. These four bytes just allow for more individual files to be part of a really large file system. These files have a unique item number. There's a name, download terminated, and the name length field. It might include that in the byte. So if you look at what we have up here... All right. We have ABC... Oh, that's four bytes. Three is the length of the name, not included, also just the ABC part. Seventeen is just the item number. That should be the range number in my example. And what would the correct one be? Well, it's these four. That's four plus these two is six plus these two is eight. Plus... What do I have right here? Two, four... Two bytes, that's four. Four bytes, that's eight. Should be four bytes. Should be four bytes for this, two bytes for this... Let me start over. Four bytes, four bytes for this, two bytes for these is now a total of eight, and four bytes for this is a total of twelve. And that's what the perfect length field is, twelve. Are we gonna break my mouth for you? Thank you. All right, so basically, this is what it looks like before we start thinking about what happens when you delete something or other. The entry for 1, 2, 3, 4 covers all the way to the end of the block. So the size is 485 instead of just the size of what would be 13 bytes. If I delete test.c, what happens is the record for ABC is now 12 bytes long. The record for ABC now becomes 12 plus 15 bytes of test.c. So it becomes total of 27. The record is still full, the chunk is still full. If I delete ABC instead of leaving test.c, if I test.c is still here, but I delete ABC, then I have to change it to 17, and I have to correct it to zero, because there is no previous entry. So basically the idea is nothing stands across the chunk, block boundary or sector boundary. Within a chunk, it's sort of a little more complicated than the last whole format, and you try to have a big space. For example, when you create a new file, you have to read through the directory to find out whether the file name already exists. As you read through the directory, you can say, why fast as you read through the directory, you can remember the empty, fixed-sized blocks that you've seen, and the name doesn't exist. You know where you're now putting the new name and where you're going to be creating it. Here, as you read through the main directory, you can see the name already exists. You can see whether there's an unused space where the size of the record line field is the size that it needs to be. You find the first space in the directory that largely contains the new entries that you might be creating. You can read all the way to the end of the directory and find out if that name already exists in the directory. But if it doesn't exist in the directory, and now he doesn't write the file once, and the directory once, you know where you put that space. You don't actually have to go explicitly hunting for space to put the new entry. You sort of get that as a byproduct of having to decide what the new name exists in the directory already or not. Okay, next topic is better locality. Basically, we want to arrange things so that without really too much extra work, we can keep things near other things. The data blocks of a single file near other data blocks of the same file. The data blocks of a file near the item of the same file. We want to hopefully have fewer seats. Now, before I start writing this, if you remember, we already talked about, well, you can't do disk editing anymore We don't know where anything is anymore. We don't know physically how far apart anything is. In the classical view of how operating systems talk about disks, you do the entire geometry of the disk. You know the exact number of services and the number of sectors for tracking. For tracking, you do exactly everything. So you can do the math, you can do exactly which cylinder everything was in. So you knew how far apart everything was. We're going to try to get better locality. Locality in terms of what? We don't know where anything is. How do you do locality when you know where nothing is? And the reality is that we're really the same way. When this was really designed, cylinders were still cylinders. We still knew where everything was. We hadn't actually introduced the so-called logical block addressing yet into the hardware disks unless in the operating system interface. So with this really designed, you knew exactly where everything was and locality really meant locality. When we talked about disk scheduling in today's world and how disk interface works, logical block numbers that are numerically near each other are still generally physically near each other. So if you put things in two blocks that are numerically block numbers near each other, then you do know you put them almost always when you put them physically near each other. You don't know some subtle things like automatic bad block forwarding, but both have very clear records. Hopefully in most cases today, you have no automatic bad block forwarding. You have no bad blocks. But you may have bad blocks. When this was first designed, the manufacturing of disks was not as perfect as it is today. You have this very uniform layer of magnetic oxide coating, so you can just deposit that and make it a totally perfect surface. The technology for that has gotten better. So even the manufacturing of disks used to have some bad blocks sometimes. Today, your hard disk probably has zero bad blocks on it, unless you're not lucky. So we're still going to be able to make use of a noticeable count that's relevant enough to still get things near each other and make the performance of using the files much better than it would be if we weren't trying to do this. So just like we talked about, you know, variable-length filings has also been introduced in the so-called FAST file system from the University of California, Berkeley. This is what the file system was named after. The FAST file system, this nation's file system, FAST variable-length filings, that can make it minuscule, the balance is lower, but on that matter, the file system is still quite a bit faster because of this change. So what we're going to do is divide the total amount of disk space into a sequence of cylinder groups. So originally, I think cylinder groups are really, literally a consecutive range of cylinders. Last thing, between cylinder number 5 and cylinder number 10, those are all near each other, and at most, they sure seem to be apart from each other. Cylinder number is what matters in terms of distance, and rotation, we have no control over. So putting things in, two things together in the same cylinder is, the bus you control in the bus that you pick in nearby cylinders is still pretty good. You don't want them in far apart cylinders. So we're going to divide the surface of the disk into these so-called cylinder groups. Now it's really in the range of imaginary cylinders, but it's still a range of, like, the block numbers that are numerically near each other and thus still tend to be near each other. And each cylinder group is essentially like a little miniature file system itself. Each cylinder group contains a redundant copy of the superblock. They also rearrange what they put in the superblock, so now you set up one superblock. With one superblock, you each can keep the one copy consistent with itself. There's only one copy. But if you have multiple copies, when I write this copy, do I have to now write only the other copies? No, because now the superblock is read-only. So the data structure that used to be... The superblock contains some information that's always been read-only, like the total size of the files and so forth. The superblock also used to contain, for example, the beginning of the free blocks, the beginning of the free items. That's now been removed outside the superblock. So the superblock does have to be able to read-only information. The beginning of the redundant copy can be in a very cylinder group. So now we get to the managers. The superblock, when you need to go read it, there's always a copy nearby, obviously called the column on a seek-away. And if you have a hard disk, if you have one clock like that, what happens if your superblock blocks with that? Now we have backup copies in case of failure, as well as just the thing nearby. So we've got a copy of the superblock, and we do so by moving the intervials into the beginning of our little intervials. The subset of the total disk space. You might have a range of consecutive cylinder numbers. That means I have a subset of the total storage capacity in each cylinder. This is the blocks that are in that set of cylinders. I've set up the items. Instead of having all the items together at the very beginning of the disk space, I have a little pool of items that can be in each cylinder group. I have a separate list of free blocks in the cylinder group. I have a separate list of free items in the cylinder group. All the information within the cylinder group constitutes everything I need to use that cylinder group as such as a little miniature file system. It's not exactly used that way, but we take advantage of the ability to do things within the cylinder group, because everything is still contained within the cylinder group. So now we have the disk of that up. We have to decide what to put where. Should I put everything in one cylinder group, and when that fills up, put it in the next cylinder group? When that fills up, go to the next one? Should I put things sort of randomly in different cylinder groups? That's great things. Should I just put them in randomly? But now, before the file system changed, before we had cylinder groups. Not literally so, but you can almost imagine, almost think of it as every block that gets added to any file gets put in a random block number sub-run surface of the disk. We had really no control over which block that we got. We should take the first block of our free list, and it might be anywhere. Now we have a free list in each cylinder group. So all I've got to decide is which cylinder group to put the next block into. I've got to decide which cylinder group to use the next line up from. The subs of the inodes are still pre-allocated, but the subs of the inodes are at the beginning of each cylinder group. So when I create a new file, which cylinder group should I use the inode out of to make that new file? When I enlarge the file, rather than block the file, should I put that block in the same cylinder group as other parts of the file? Should I put that block in a different cylinder group? Instead of policy, decide which cylinder group to use for what and when. The first rule is, the final rule of the directory, try to keep the inodes for all the files that are in that directory in the same cylinder group as the inodes of the directory itself. If you CD to a directory, you say LS, you see a bunch of names, the inodes for every one of those names, we're going to try to keep the inode for each one of those names in the same cylinder group as the directory that contains those files. But obviously we can't do that forever. If we did that forever for all files, and again, we can't do that forever, but if we did that forever for all files, then again, everything would end up in the same one cylinder group. If I got a thousand files or a million files in one directory, I can't put them all in the same cylinder group, but I probably don't want to. If some of the files in that directory are regular files, maybe put them in the same cylinder group as the directory itself, but some of those files, if they are the cells' directories, then should the files, if A is the directory and A contains B, and B is the directory, should I put the files that are containing B in the same cylinder group as B, and should I put B in the same cylinder group as A? Again, if I do that, everything ends up in the same cylinder group. I have to have rules for locality, such as this rule, where I try to keep these things together, but I also have to have rules where I deliberately put other things elsewhere, because otherwise you end up with everything concentrated in one cylinder. And so when I create a new directory, we're going to deliberately put that in a different cylinder group than the containing of the parent directory, as I do. But regular files, there I know are, we try to put them in the same, I'm sending the cylinder group as the directory contains those files, but the directory file, so that we don't hungry everything into one cylinder group, directory files, we use I know that of a different cylinder group for a new directory, so the data files, the regular files, are in that directory, in that cylinder group, the regular files are in the first directory, and we go in that first cylinder group. When I'm putting data blocks, where should I put the data blocks? I'm going to try to put the data blocks of a file, here I mean file in the directory, or the regular files, they're all stored the same way, it's a bunch of data blocks hanging off the I know. I'm going to try to put the data blocks of a file, whatever type of file it is, I'm going to put the data blocks of this file in the same cylinder group as the file that I know. So if I read the I know, I have the block numbers of the pieces of that file, and right there, you're very nearby, or in the box itself, you might access the control interface of the I know, as well as the contents, the data blocks, all vary to each other. But again, if I do that, if I just do that without any sort of exceptions, a large file is going to consume all the space in one cylinder group. So now I can't put other parts of other files in that same cylinder group. I have ten files. If one file is really big, it takes up all the space in that cylinder group. The other nine files I put started in that cylinder group, we now have run out of space in that cylinder group. So I have to have rules where I'm trying to get locality, but I have to deliberately break that locality so I don't concentrate too much all in one place. So every time you exceed all the direct blocks in an I know, when you use all the direct block numbers in an I know, you think about, we talked about the so-called bind scheme, we have in the classical version, we have ten direct block numbers. Those are the block numbers of actual data blocks right there in the I know. But then after that, when you get to the 11 block data, that block number is in the first level indirect block. In the I know, you have the single direct block number. In there, you have the block number of more data blocks. So when you go from the 10 block to the 11 block of the data of the file, you already have a little bit of sort of a performance discontinuity. The first ten blocks, you need to access them all. Immediately, easily, directly. The 11 block, you have to go read the I know to get the block number of the first level indirect block. And then once you have that block, you now know the block number of the 11 data block. So small-level performance difference there in getting to the 11 block. And so that's a viable time to just do the other kind of small performance difference there that let's begin in the new stoder framework. When we begin in the 11 block file, we consume all the direct block numbers, all the data that we fit into the blocks identified by the direct block numbers. If we go to the first block that has to be in the indirect block, we begin into the cylinder group. But again, we can't then stay in that second cylinder group and leave the file bigger and bigger and bigger because we're going to run out of space in the second cylinder group. So I have to use all the data blocks to fit into the direct block numbers in the I know. That's a natural place to say, let's choose now to switch to a create-worth file in the different cylinder group. From then on, after you consume the fixed total amount of additional space, the original implementation was after each one megabyte of additional space, different implementation now includes large amounts of space because its capacity for so much larger file was so much larger. But the point is, the shift from the first cylinder group to the second cylinder group naturally occurs when you consume all the space to fit into the direct block numbers. And the rest is just simply a fixed interval. And from how it gets bigger and bigger, every fixed amount of data broke in the file, you start allocating those blocks in yet another cylinder group. So now as you read through the files, it's actually a long seat, but those seats are very, very short. If you read the file in random direct order, you go from the I know to the cylinder group that has the data in it. That might be one long seat because it's the first cylinder group used by the data in that file. But any part of the file other than the beginning of the file, if you read it in a random order, there's one plus probably one seat, and then the data is, again, all together. So now, basically, with paying almost no cost involved, we have the advantage of redundancy in the superblock information, but we also have almost the right to work. We have now the ability to easily keep things to each other. Here you go. So the last thing we talk about here is looking at pre-allocating inodes. The first two were hopefully reasonably straightforward to see what I'm talking about. This one seems almost impossible, but when you think about it, it works perfectly. We're not going to have pre-allocating inodes, but we're still going to have the advantage that, given an inode number, I can easily do a small amount of math with that inode number and translate it into, very easily, directly to where the inode is stored on disk, but I'm pre-allocating inodes. This comes from Microsoft Windows NT. It's still what's used in all versions of Windows today. So there's no saying any problem with computer science can be solved by adding a little interaction or a little abstraction. That's what we're doing here. The original design, we have the inodes all pre-allocated together. NTFS, the Windows File System, doesn't call them inodes. I'll tell you exactly what I mean, but this is the equivalent of inodes. The original design, you pre-allocate all the inodes together. It's one contiguous chunk of disk space. Even the original Microsoft File System did the same thing, pre-allocated all of the inodes, which to them are also directory entries, together in one place. And so they're all being taken one after the other. So I can take the inode number and just do the small math on it. Inode number divided by number of inodes that fit into each block is the block number of inodes in which that inode is stored. Inode number modulo the number of inodes that fit into a block is which inode within that block is the particular inode you want. It's really straightforward to get the inode number where the inode is. The change here is, instead of storing the inodes in a physical chunk of disk space, we're going to have to store the inodes in a file for the inodes. If you think about a file, if you want the 300,000 byte of a file, you know where that is in the file. If you have a block size, say, of 512 bytes that is in YFS, and you want the 3,000 byte of that file, 3,000 divided by 512 is which block of the data of that file contains the 3,000 byte of that file. And 3,000 divided by 512 of the block size, 3,000 modulo by 512 will tell you which byte is in that block. So you can go from a byte number to a location of the block relative to the file, and the contents of that block is the byte you want. The same thing here. If over an inode number, we're storing the inodes in a file full of inodes, the inode number times the size of an inode is the byte offset within the contents of that file. That divided by the block size is which block of the data of that file contains the inode we want. And that modulo block size tells you which byte is the beginning of the inode within that file. So just imagine a file and the contents of that file. This is the contents of the file being your source code, or the contents of that file being directory after directory after directory. Here, the contents of this file is inode after inode after inode after inode. They're all one after another. The byte numbers relative to the contents of the file advance the same way as the continuous chunk of physical disk space. So we're just changing to a level. It's basically the same thing as virtual memory versus physical memory. Instead of storing the inodes—this is a very loose analogy— instead of storing the inodes in the very loose equivalent of physical memory, we're storing them in virtual memory. I don't mean that literally. We're not storing them in memory. But what I mean is things you put in physical memory are physically continuous. No other level of distraction can hide the fact that they're not physically continuous and hide where they actually are, where they appear to be in physical memory. If you see it in physical memory, you have an address in physical memory. Things are where they appear to be. Virtual memory, they appear to be certain places, but they actually could be anywhere. They could be discontinuous in physical space. They look like continuous in virtual space. I can go to 300,000 bytes of your virtual address space. By translating that virtual address to a physical address, that leads me to a particular physical page. Here I'm going to translate the inode number into a virtual location within the contents of the file, which is just a byte offset within the data of the file. From there, I can go through the list of block numbers that belong to the data of this file, and I can get the actual physical space. Because it's now a file, the collection of inodes now appears to be virtually continuous. If you look inside this file, you see inode after inode after inode, and I can make the file bigger and bigger by writing more inodes into it. So I can allocate more inodes just by writing those inodes at the end of this file. Which physical block of disk space it ends up in doesn't matter, because looking in the virtual view inside this file, it still looks like they're all continuous. The interesting thing is how to make this work, because files are described by inodes, but now our inode is stored within a file. So how do we find the inodes in the file when we can't find the file in the contents of the file without the inodes for the file? So now I have to talk about some details about how NTFS is implemented. So along the way, I'm telling you how we don't have to reallocate the inodes anymore. We'll also give other examples of how you can represent various things in the file system and some of the variations we have talked about. Here's some different examples. So first is terminology. When I'm on an inode, and I will often still use the term on a inode for it, the Microsoft name for this is a master file table record. The collection of all inodes is the master file table. The master file table is a sequence of the record after record. The records are all the same size. So record number 12 is immediately after record number 11 and so forth. A master file table record is the equivalent of an inode. It's also, by the way, the equivalent of the original Microsoft file system directory. You've got the original Microsoft file system. File names and inode information are all in one place together, in one record together. Unlike in this directory, it's just a file name going to a number, and the inodes are separate. Microsoft has always had these file names and inodes as well. The master file table is really just a collection of the equivalent of all inodes. It's a collection of records, so the master file table records are very much the same record. Each inode and inode record is exactly one kilobyte in size. They're all the same size. You might see one kilobyte pretty big for inode information. An inode can be 64 bytes. Remember, that's the size in one address. Using four would make it a bit bigger. The inodes are, we normally think of them as small. Here, they're several times larger, at least than what we normally think of in inodes. It turns out that the use of that won't go in the web space. They get the advantage of the inodes are all fixed size. The same thing you could always have, but now fixed size is larger, so you have to put stuff into the inode that we normally wouldn't put into an inode. The location of an inode. If I have the MMT record number, MMT record number times one kilobyte, because that's the size of each record, the record number times one kilobyte is the location within this file at which that MMT record begins by location within the file. Just like any other file, the contents of the file. More at the moment, how do we know where this file is and how do we know when the block belongs to the contents of this file? If you have the information about the blocks that belong to the contents of this file, we can find any byte offset within the contents of the file and from the inode information to figure out which block, so the disk space stores that particular part of the contents of this file. The boot framework block is in Windows files in the US. The boot block contains, as we talked about before, the boot block contains the executable code. If the hardware knows how to read that first block of disk space into memory, you know they don't start executing it, unless it knows how to read the kernel into memory. When you find the kernel within the file system, read the kernel into memory and start executing the kernel. You know kernel start, for example. They don't have a separate boot block versus super block. The super block kind of information that they have is in what they call the boot parameter block. The boot parameter block is actually a little sub-piece within the boot block. Most of the executable code in there, but it's still a table worth of data and kernel information, is the boot parameter block. The boot parameter block, just like otherwise, it's the first block of disk space. Microsoft really didn't have a choice there because PC, pretty much all computer hardware, knows how to read the first block of disk space. It doesn't know when the hardware of your computer doesn't know anything about the operating system that's on that disk. It just knows, I can read the first block of disk space into memory and start executing it. Microsoft sort of had to use block 0 to be the boot block, because that's what PC is harder to build to do. But again, this gets to the same advantage we've already talked about now. Block 0 is never going to be used as a real block number. You can use block 0 to represent things that are sort of non-block numbers. One thing that's in the super block, the boot parameter block, is the location and beginning of the file full of MFT records. If the block number of total disk space where the MFT record should be where the file full of MFT records, where that file begins. You now know the block number of at least the first block worth of MFT records. You have at least some of the MFT records in that block. If it's only one block, you don't have very many. But that did more than one block, because they made the first 16 MFT records. The first 16 MFT records are allocating continuously, and the block number at the beginning of that space is in the super block. So if I look at the super block of the boot parameter block, I have a block number where the first 16 MFT records are pulled out together starting there and then into disk space. So now I can get to the first 16 MFT records, basically the first 16 in iNotes. I can get to them. And notice I have not actually looked in any MFT record to get these MFT records. I go to the boot parameter block to find where the MFT file begins, and there I can see the first 16 MFT records. So I do not have sort of an end up recursion problem to find those first 16 MFT records. I don't go through the MFT to find those MFT records. I go through the boot parameter block to find at least those first 16 MFT records. From there I can find, turns out I can find any other MFT record I need, and I can get to at least the first 16 without even using the MFT to find the MFTs. The other records after the first 16 can be allocated anywhere in the disk space. They actually try to keep the entirety of the T file contiguous, but they don't have a way to guarantee that. When they format the file system, typically you can bury this ground when you're forming your file system, but the default is the first one-eighth of the total disk space is reserved to be what they call the NFT zone. When you start allocating data blocks with other things, not pieces of the NFT, they get allocated outside of that first one-eighth of the disk space. They get allocated outside of the one-eighth of the disk space, leaving the first one-eighth of the disk space totally unused, other than the very beginning of it is those first 16 NFT records, and after that is a bunch of unused space before the space starts getting filled up with other people's data blocks. So now we can allocate more and more NFT records as the NFT file grows. NFT files generally grow continuously because you've been leaving unused those physical data blocks following where the first 16 NFT records began. But if you run out of disk space, you'll get the other seven-eighths of disk space, and you can use up all the space over here, then the system sort of has no choice but to start allocating other data blocks within the subfold NFT zone, and now maybe you can't allocate more NFT records, as in the contiguous form at the beginning of the NFT. Because of that, it still works. It's more efficient if the NFT is totally physical contiguous, or is contiguous mostly with a small number of the previous pieces. Okay, so let's talk about some of the records that exist in the NFT. We're going into more detail than we really need to talk about how to not pre-allocate the items, but I want to start painting the complete picture and not have to see if the system really still works. You don't have to know, for example, the number two in the root directory. Same kind of thing here. There has to be a fixed NFT record number, which is, again, equivalent to an item number, that's a fixed, well-known constant NFT record number that is the root directory. Otherwise, the file naming tree doesn't work anymore because you have no way to find the root directory. You can't name the root directory. Any other directory you can get there because of the file name. But you can't get to the root directory because of the file name, because there is no name in the root directory. The slash character is a separator character. The slash characters are not stored in the disk. No directory under it has the name slash in it. The root directory is dis-identified because you know the constant item number. You know that number without looking it up in the directory. You get to translate the root directory's name into the item number. You just normally go get the number. Same thing here. You have a fixed NFT record number, which is the root directory. Before you get to that at all. Everything I'm talking about here is in the first 16 NFT records. This is all in the NFT. It was part of the NFT that we know the location of because of the root parameter rule. Everything I'm talking about here on this slide you can find without looking inside the NFT record. You can find the particular NFT records. NFT record number zero, the first of those 16, is the NFT record that describes the NFT. The sound is very self-referential and recursive. It's very science fiction or fantasy or whatever. It probably doesn't move because it has this block. The idea is it is not actually recursive. It's not self-referential. You can find this NFT record, number zero, without looking in any NFT record to find it. The boot parameter block tells you where those first 16 NFT records are. Once I can see this NFT record, you can see the data blocks in which the entire NFT is stored. Thank you for now, simply for what we talked about. I have block numbers that are listed that tell me where the contents of this file are in any file. Every file has an NFT record. Block numbers tell me where the contents of that file in which the data blocks are stored in. If the NFT is larger than just the first 16 records, where is the rest of it? The block numbers of the rest of the NFT are stored in the NFT record for the NFT. It's just like any other NFT record. It identifies the block numbers in which it is stored the contents of that paper file. It's just like any other file. It seems self-referential, but it's really not. Normally, you would have to read the NFT record to find the block number in which part of the file is stored. But here, I can find the NFT record for the NFT without going through the NFT record itself. I need the NFT record for the NFT primarily only to find the rest of the other NFT records. I don't need the NFT for the NFT file itself. I do not need its NFT record to find the beginning of the NFT in which it is actually stored the NFT record for the file itself. It's hard to say that without the sound of the self-referential, but it's not. It's all grounded because I go through the parameter block, not the NFT record, to find the NFT record for the NFT. That's the most important part that makes it work. NFT record number five is giving some of the NFT records that are not very interesting. NFT record number five is the root directory. In Unix, it's a constant two, or a YFS constant one. We have to know that five is the root directory in every NTFS format file system. NFT record six happens to be the NFT record that describes the free blood medmap. The free blood medmap is stored someplace on disk. Instead of being a contiguous allocated chunk of disk, it's now stored like a regular file, which means it appears to be contiguous. You treat the medmap as if it's a contiguous chunk of space, but it's really only virtually contiguous. It's the contents of a file, bit after bit, byte total bits after byte total bits, within the contents of the file. By looking at the block numbers in NFT record number six, you now know which block of actual disk space is storing which piece of this file. The virtual contents of the file is the free blood medmap. Block number seven, NFT record number seven, everything is a file. In NTFS, everything is a file. For example, the boot block. In most file systems, the boot block is only accessible by reading and writing the raw hard disk, not going through the file system, because block zero is boot block. It's not actually part of the file system. But in NTFS, even though the file system doesn't directly manage block number zero, it's the boot block. It does have a file name. You can get to it by going through a file name that translates to NFT record number seven. NFT record number seven is essentially fixed in this information. It says this file is one block big, and the block number for the contents of this file is block number zero. By going through that file name, you can now read and write the contents of the boot block. If you want to update your boot code, you can do that through boot code file name. That's actually boot code and bad. In a classical file system where you can't do that, you, for example, have to be grouped and read and write the raw hard disk. It's very hard to access the boot code. In NTFS, there's still file protections that will prevent you and me from changing the boot code. But you and me can actually now refer to the file name that is the boot code. Now, the only thing stopping us is the protection on individual files that would mean you can't change the boot code. The virus infects your computer. We'd really like to change your boot code, because now every time you boot your machine, the virus gets to be pretty accurate. Let's see. What's in an NFT record? In the equivalent I notes, we have to have essentially the kinds of stuff we're used to being in an I note in every NFT record. It's a kilobyte long. What's in an NFT record is a list of attributes. Different attributes have different sizes. Within that NFT record, it's just one attribute, another attribute, another attribute. Coding the information has to be represented about this particular file. Each NFT record is identified by what we call an NFT reference. An NFT reference is a 48-bit record number. That means the maximum number of records we have in an NFT is 2 to the 48. That's a pretty big number. It's maybe a little short-sighted, because why not 64 bits? They use a 64-bit representation to be the NFT reference. The first 48 bits is a brief record number, which is a sequential number of NFT records. That's the status equivalent to an I number. But then the other 12 bits is a recently called sequence number. It's the same kind of thing as, for example, LAND 3. We delete a file, reclaim the NFT record, and later recreate another file with the same record number. The I number got reused as a different thing than it used to be. Part of the file system, part of the user process, whatever, still has the record number equivalent to the I number. We now have the ability to tell that the thing that's now used in that I number is a different thing that used to be used in that I number. If there is a different reuse of the NFT record, the equivalent of the I number, everything in the reuse of that increments the sequence number, so we now know that it's a different use of the same NFT record. The NFT record begins with the record header. The record header contains the current reuse count, so when you reuse it, this is the field you go to to get the current value, you can get that value. Each NFT record stores its own current sequence number, and this is followed by some variable number of attributes. The attributes are... The attribute control information is inside the NFT record, but then the attribute can refer to data that is also resident inside the NFT record, or data that is non-resident. Within the one kilobyte NFT size, there's room for about 700 to 800 bytes of other stuff, besides the stuff that's always there, like the file owner, the technical information, has to be there for every file. Turns out the file name is there for every file. Again, this is really like a 9.0 that's unified, sort of equivalent of a 9.0, and a directory that is like 3.0, and has gone to files and stuff. What do I mean by non-resident data? You can, for example, the data content, the regular data file, the data content of that regular data file can actually be stored inside the NFT record, if the data file is short enough. You typically have, depending on what other attributes need to be there, you have 700 to 800 bytes of extra space in the one kilobyte. If you have a 200 byte data file, the data of that file is going to be entirely inside the NFT record. Most files don't fit into that. The attribute in the NFT record that identifies the data blocks, the contents of the file, may refer to the external data blocks, or may say the data directory here is inside the NFT itself. The filing is always there, the control information is always there. In the NFT record for the NFT itself, some attributes are there for pretty much every file, like the file name is there for every file, the protection is there for every file. Some attributes are special, only there for particular files. There is a bitmap attribute that is there in the NFT record for the NFT. The bitmap record that is in the NFT for the NFT identifies which NFT records are in you. It is a bitmap of three NFT records within the NFT. Think a directory in YFS. There are some slots that have directory attributes actually using them, and other slots that physically exist that have zero inode number, so they are available for free slots. In YFS, you don't know which slots are free and which slots are not free. You have to read through them. If you are creating a new name, you have to find an unused slot. Here you can actually look at the bitmap that is in the bitmap attribute for the NFT record of the NFT itself, and know which slots which NFT records are free to allow you to make a new file in that new record. So let's see an example with an NFT record. Here is some stuff. I've got the header that is in every record. The header includes, for example, the reuse count. Standard information includes the ownership. The timestamp is less modified. The creation time and so forth. The file name is oddly not considered part of the standard information. I don't know why, because the file name is required to be there also. It is right after the standard information. The file may have more than one name. If I have hard links to it, because they put the directory information, the file name information, inside the equivalent of the inode, making hard links is awkward in Microsoft Web Systems. In Unix, you just have two directory entries with the same item number, identifying the same single item. Here I have one NFT record, which is the equivalent of an inode, but also the equivalent of the directory entry. If I have hard links, I have two name records, two name attributes, in the same NFT record. Once I have those two name attributes there, it became like a hard link, in that they are both identifying and labeling the same NFT record, just the same file. Similar to what I already talked about in Unix, the actual length of the name is 255 characters for the name. It's a character length name field in a similar way. Then I have other data stuff. That data stuff might be anything. I have unused stuff. The data stuff might be other attributes that I need for that file. It might be the contents of the file itself, if the contents are small enough to fit. If the data is in the NFT record, it just makes that particular attribute bigger than it otherwise would be. If the data does not fit, or choose not to put the data into the image record itself, then it is identified in a sequence of extents. We talked about extents being a continuous chunk of disk space that has a first block number and a count of particular blocks at that location. Unlike individual blocks, the contents of the file is a sequence of extents, where the first chunk of the file is, the next chunk, the next chunk, and so on. How do I actually use those extents? It's described as a continuous chunk of space. It looks something like this in the disk space. This is a slightly simplified description of what really goes in there. Each extend is described by three numbers, which they call VCN, LCN, and PLANK. I'll see in a second what VCN and LCN are. VCN is the virtual cluster number. This is Microsoft terminology. It's the block number relative to the contents of the file itself. Ignoring where everything actually is physically, this is the beginning of the first block, the data of the file, the next block, the next block, the data of the file, which comes from the data of the file, is described by this extend. Having identified that I have five, the length is the number of blocks, having identified that I have five blocks of the file that are continuous, and these are the first five blocks of the file. So they begin at virtual block number zero of the file. Where, from the physical disk space, are those five continuous blocks? They are the first five blocks of the file. We know they are continuous somewhere physically. The LCN says the absolute block number relative to the entire file is an actual physical block number where those five continuous blocks begin. They call that the logical cluster number. In Microsoft terminology, the logical cluster is what they use for blocking. So virtual block number and logical block number are the easier ways to think about this term, but this is Microsoft terminology. I can't say it's the number of continuous blocks. For example, I might have a file whose extents look like this. It says beginning at block zero of the contents of the file, I have three blocks that continue somewhere, and where they are at the beginning of the physical block for you. Blocks zero one and two are physical blocks 41, 42, and 43. Now next is virtual block three. Five continuous blocks are blocks 123, 124, 125, 126, and 127 physically. And then virtual block number eight follows after that, and it's nine continuous blocks beginning at physical block 17. Basically, you just list this triple after triple. It's a set of three numbers after a set of three numbers. One after the other to list where the contents of the file is stored. So in the MFT record, the attribute in the MFT record that says this information would just consist of a bunch of this format stuff. You might ask why. There seems to be a redundancy here. Zero plus three, well, this one begins at three. Three plus five is what begins at eight. Why do I have to keep listing the beginning block number? Because I can infer that from just reading it subsequentially. The reason is it occurs in lab two as well, but in a different implementation. This allows you to have a whole file. I can actually have a gap in the virtual block numbers that says, instead of this being eight, it closes with 18. So this says, beginning at block three, first five continuous blocks, virtual block 34567. 34567, I can count to five. 34567 are contiguous beginning somewhere on this page. Here it says beginning physical block 123. If the last extent began at block 18, virtual 18 instead of virtual eight, it says the next ten blocks of the file simply don't exist. They don't store anywhere. If you read any of those ten blocks between block eight and block 17, if you read too many of those, you get zero contents automatically. Same thing as you do in lab three when you read from the contents of the poll. If you write to that portion of the file, then the file system has to allocate space for it and store it like you wrote it. But until you write to it, you just seek over it and write and let it hold the file. It consumes no space, but the files are as excellent as that bit. So how do I use more than one MFT record? MFT records are fixed size. One kilobyte is big, but it's not arbitrarily big. The ones I have a large number of examples of. If, because of the weightings of an allocator, if I don't get much continuous allocation, if every block is independent of every other block, it might be located in different places in my order of the block, if every extent is of size one, my list of extents could be quite long. I have three integers of every extent, and for the contents of the file, if every extent happens to only be one block big, I'm missing a lot of space representing the only extent that make up the contents of that file. So for that or other reasons, I might have more data to store about the file, not the contents of the file, but the information about the file. It might be bigger than that one kilobyte. In a tenable file, it fits fine. The most common reason you don't fit into a single MFT record is simply because the file is fragmented so much that the list of extents is really, really long. I could have a gigantic file. If I'm lucky enough that it's all one extent, it easily fits into one MFT record. If I take that gigantic file and allocate it differently, it might be a really, really long list of extents because each extent might be very small in each. Every MFT record begins with this header, and in the header is the reference to the... If I have, say, three MFT records, at the beginning of every MFT record for this one file, the beginning of each record is the reference to the MFT record because it would be the first MFT record of that file. So if I have three MFT records, I just haven't stored that much information about the file. The first one we know is the first one. The second one refers to the first one. The third one refers to the first one. Whichever one I'm looking at, I'm always trying to get my way back to the first one. The reference is a 48-bit record number and then the quote I'm reading is killed. In what they call today's MFT record, which is that first MFT record, it's essentially a table of contents. It's an attribute list attribute. It's an attribute that lists all the attributes that are stored somewhere within the MFT or MFTs of this file. Each attribute has a type code. This is what kind of attribute it is. In the attribute list, it says the type code and which MFT record is the reference to the MFT record. Among the MFT records, there's been one file. Which MFT record for this file contains that particular attribute for this file? So now if I'm looking at that, I'm going to find a particular attribute. I don't have to read what may be potentially huge numbers of MFT records, hopefully not, but it could be dozens or more MFT records for one file. I don't have to read through all of them. If I'm looking at the table of contents, I know which MFT record has the attribute. I'm looking for those numbers. So it's the attribute type code and the reference to the record that contains that particular attribute. So we now have the ability to represent files that are arbitrarily large during perhaps more than one MFT record in this very fragmented file. The list of extents could be very, very long. By the way, if it's also in the file that has many, many hard links to it, it would have many, many main attributes in it. It's not so common for it to have huge numbers of names, which is common, I won't say common, but certainly not uncommon, whatever, to have files that are fragmented enough that the list of extents exceeds the size of that MFT record. So I put the data outside the MFT record so I'm not consuming that much space in one MFT record. I can put a megabyte file with all inline data, but I don't even consume a large number of MFT records, because I can't put very much data in each MFT record. The data, if it's large, is mutually allocated outside the MFT record, so I'm not consuming certain space inside the record. But again, I have a fragmented file. How do I get a fragmented file to still only use one MFT record, even defragment the file? Basically, you re-throw the disk space that belongs to this file, and you essentially start copying block to block, allocating continuous space over here, creating a file block by block, copying the first block, wherever it might be, to the first block of the new contiguous space, the second block, wherever it might be, to the second block of the continuous space. You go through the file block by block, writing to block after block of the contiguous space. When you're all done, you have a new copy of the file that now blocks through the data around. Now it went after the other contiguous. Now, for example, all the blocks change the MFT record to say one extent is 100 or 1,000 extents. If you have a Windows machine, and you're looking at computer management, you can actually manually start a de-fragmentation here, just because you don't know. Windows seems to, these days, come pretty clear. You can automatically do that somehow periodically, but it's actually fun to watch it do it. It'll draw you a little color diagram showing all the blocks and files that are fragmented and files that are not fragmented, and you can see the blocks on the ground and collecting them all together with the unfragmented color on one side and disappearing from the fragmented color where they happen to be before. All right, so the last thing I'll say about NTFS and just sort of being no longer directly related to not pre-allocating dinos, I've sort of tried to paint enough of a complete picture that you can see the system still works even though I'm not pre-allocating dinos. But one extra thing that's also interesting about NTFS is how they represent directories. It's kind of a schizophrenic string system. It has MS-DOS-type directory entries where the name is inside the inode, basically inside the MFT record. That gives a name that goes with this particular file, but it doesn't make a directory. It doesn't make it a way you can see someplace in LS and see all the names in the given directory. You know, I have an MFT record. The name in an MFT record is the reference to the MFT record or the directory in which this file was contained and then the name of the file itself. So if I read it, every MFT record, I have the information. I can reconstruct a tree representation of this information, essentially have individual sort of broad links in no particular order, but for each name, I know which directory it's part of. That's not very usable to do things like LS the directory. So they have a separate representation of directories themselves in a way that's similar to UNIX but different from UNIX, similar to YFS, different from YMS. So it's a tree. It logically looks like a tree, and actually two representations of it. One is represented basically like a tree, and the other representation is this weird flat representation where you just have MFT record after MFT record, each with its own name in it. So over here, tree representation. In UNIX, it's an unsorted list of just the name of the directory. Here it's a different format. So unlike UNIX, it's unsorted here. It is sorted. The names in the directory are formed a B-tree. So it's a balanced tree representation of the information of the files in a single directory, and it's sorted by the file name. So I can actually find my way down the B-tree to get to that name reasonably efficiently even in a very, very large directory. In a large UNIX directory or a large YMS directory, the only way to find a new name is for each country because that name is a period. Here I go through the balanced tree and get to the... I get to where in the tree the record for that name should be just the alphabetical structure of the tree. I get to where that name should be if the name exists in the directory. If it exists, I have the directory name and then I'll have the reference to the inode and the directory number. If the name doesn't exist, you can go down the tree alphabetically. I get to a place where it should be and I can see that it's not there, so I know the name doesn't exist. Give me a faster look-up. Each effort in the tree still has the same kind of information that a UNIX-style directory has. It has a name and a reference to the MMT for that particular file. Again, the reference is the 40-bit record number, like the item number, and the code in the tree's code. The last thing about directories in NTFS, and also I guess the last thing about NTFS in general, just like the MFT record for the MFT has a bitmap attribute that tells you which MFT records within the MFT are free versus used. Within a directory, the bitmap attribute of the MFT record for the directory tells you the directory of each of the directories. So if you're trying to create a directory of the directory, you have to figure out where to store it and how to index it, link it into the alphabetical entry structure, so it tells you where unused slots are and where you can store the dimensions. That's enough for NTFS. I'm going to do two things. I'm going to start the next topic a little bit, but I'm also going to leave some time because people seem to like to ask questions at the end of class, not at the end of class. I want to make sure you have time to answer your last three questions. All right, so I want to talk about protection security, guys. So I posted this afternoon, actually 17 and 16 in the book, but we're going to transition into that material by talking about protection in files. I won't get to this today, but on Thursday I'll also tell you when you run the lab 123 submit program, also when you run the click date program, how does it actually work? So I'll tell you part of how it works, if you think about it right now, I store your submission someplace. You run the program as yourself, so it stores your submission someplace that you have the ability to write it to. But it's stored in the same place that everybody's submitted. So everyone has the ability to write it in the same place your submission is being stored. So it's publicly writeable directly someplace under this. Yet it's totally secure. And to show you how that's secure, I can tell you more about how protection works in Unix systems. So I'm going to actually talk here about how protection works in Unix, how protection works in Windows systems. Unix file protection begins with the PCB as in it, who that process is running on behalf of, user ID is a number, the group ID is a number. When you log in, the group ID and user ID are set in your PCB based on who you log in as. When you fork the trial process, the user ID and the group ID and the parent have two execs until the same process, so that changes the process to exec. Each file has an inode, and the inode is information about who owns that file and the protection of the file. So when the file has user ID and group ID of the owner of the file, generally that is set from when you create a file. You are running your PCB set and you are running on behalf of some particular user, the group ID and group ID. When you create a file, the file you create is owned by the group ID and user ID. We are running it. We created the file. But in the PCB also, the inode also is nine bits for the protection of the file. This looks a whole lot like the protection bits in page table entries. In fact, it has read and write and exude protection bits. So you might imagine read and write almost seem obvious. They are a little bit subtle. It is much more subtle than describing how that actually works. But why do I have nine bits instead of three bits? In page table there are three bits that describe protection of the user and three bits that describe protection internally. I have three sets of three bits, and then two sets of three bits, and then five sets of three bits. The reason is different kinds of users get to use a different set of three bits. Three bits go together, read, write, exude, and then read, write, exude, and then read, write, exude. Which of those three sets do you get to use? If PCB, user ID, if you just have operation on the file, open a file, delete a file, whatever. No, forget to delete this file. That is not a good example. I can make that a clear example. I want to break into this file. I want to open a file. I want to open a file, and I am running as sub-user ID. My PCB says I am some person, sub-user number. The files I have says the files are only my subnumber. If the user ID that owns the file and the user ID in my PCB match each other, then I use the first three bits and only the first three bits. It does not matter what the last six bits say. I only get to use the first three bits. You might imagine, for example, the first three bits might say the file is readable to me, but not writeable or executable to me. What about the other six bits? The other six bits might say readable, writeable, and executable to everybody else, but to me, it is only readable. The three bits, you only get to use one of the sets of three bits. You do not get to use all of the three bits that somehow apply to you logically. Technically, only one set of three bits applies to you. If the user ID in the PCB matches the user ID in the inode, you get to use the first three bits. If the group ID, if the two inodes, the two group IDs do match, then you get to use the second set of three bits. Users on UNIX systems are divided into groups. If you, for example, create a file in Norway, the kinds of protection would be the file is, for example, writeable to the owner of the file and readable to the owner of the file, but maybe only readable to people within other people in my own group, or anybody else in my group. If I control the file, I can make the file. My group mates can read my file but not mess up my file. If you have no access to the file. Technically, you can have the first of the three bits be more restrictive than other of the sets of three bits. If the group IDs don't match, the other group IDs don't match, then the last of the three bits control the other two bits. Essentially, it is usually referred to as who's your group and the other. The first three bits are who's writing these match, the middle three bits are who's writing these match, and the last three bits are other ones. The mean and read and write are reasonably simple and straightforward. Read means I can see the contents of the file. Write means I can modify the contents of the file. That's true for a regular data file, but it's also true for directors. What does it mean to write on a directory? I don't open a directory in which to write to it, but certain classes of operations do cause the contents of the directory to be modified. Making a file in a directory means I have to make a directory in that directory. That means I have to give a write to the directory. The ability to create a file in a directory is controlled by the ability to write to the directory. The ability to remove a file from a directory is controlled by the ability to write the file. We talked about before. To remove a file, I'm using this bundling operation. To remove a file, we need to remove the file name. To remove the file name, we need to modify the directory to no longer have that directory entry in it. We have to get that write region on the directory. Write region on a file is straightforward, but if you think literally, what does it mean to create a file or remove a file name from a directory, that's write in the directory. Read the directory means for 10 lines, unless the directory, I can see the contents of the directory. Exchew protection is a little different. For a regular file, exchew is hopefully pretty obvious. Exchew means I can pass that file name, exec kernel call, and run that program. Exchew's mission in UNIX is totally dependent, not dependent at all, on the read region. To exec a file, in lab 2 you had to actually open the file and read it. But in UNIX, internally, it simply bypasses the read test for permission on the file to your exec in the file. It can get the contents of the file from disk into memory, even though you don't have read region on the file. It gets into the memory, you can start running the file, but you can't ever see what's in memory. For example, if the file is not readable, what is that executable you can run? If it crashes, you don't get a core dump. If you have a core dump, executable but not readable program, you can now look at the core dump to see the contents of the program. Getting it from disk into memory running it does not expose the contents, it just allows you to run the program. You never can see the contents of the file that you had to read permission on the file. What's exchew's mission from a directory? It essentially means I can use file names in that directory. Sometimes this is called search permission or traverse permission. If I say... ABC slash XYZ. Do I have permission to use file names within the ABC directory? Maybe you're asking me now, is the ABC directory executable by name? I may know the file name XYZ exists in the ABC directory. Somehow I learned that file name or knew that was a good file name to try to access. Maybe I got that from reading the directory, but do I have a permission to use the file name XYZ within the directory ABC? If the directory ABC is not exchew a YB, even though I know the file name XYZ within the ABC directory, I can't actually use the file name. So search means I can search the directory to see if that name exists if I only use the file name. Traverse refers to basically can I go through that directory email if I have it. File name one slash two slash three. I have to go through the directory two to get to the name three. If I don't actually have permission on the directory two, I can't go through the directory two to get to the name three. So basically, can you use name within the directory no matter how you might afford the name? The exchew permission on the directory controls will not use that name relative to that directory. Yes? So if two is the directory, and two is not exchew, but file three is exchew, does that mean you can't exchew file three? Because you have to go through two. You can't get to three without using the file name that goes through directory two. If, by the way, if you have a hardlink to file three that doesn't go through directory two, the inside directory two is a directory entry that has the name of three and an item number for the name three. That's what it means for three to be in the directory two. In fact, somewhere else on the disk I have a hardlink, a second hardlink, to the same inode that is the file three. It doesn't matter what the name is, but it's the same inode. I'm not going through the name of directory two if I use the hardlink on some other side. So I have to be able to get to the file to try to use that file. The other way there is through the directory two I have to be able to use the name three relative to the directory two. And so it's an exchew protection on directory two that would stop me from getting to three that hardlink. Okay. Okay, so let's see. Yeah, so let's see. Read the page from the directory allows you to learn the file names in the directory. I might have other ways to learn the file names. You know, I could write down the names and say, go do this, but independently on your own, you can read the contents of the directory, see the file names, they're all right there. ALS does literally that. But no matter how I do the names, I need the extra permission to use those names. All right, so stop there. I'll answer questions about that three for a while. And then on Thursday, I'll tell you how the last section of the section will be.


### transcript 3: 646 desktop recording

 The apples, if they use 20 for lunch and bought six more, how many apples do they have? You have the language model to delete that output text, right? So this is not a language model that is trained to predict. I mean, this is a language model that's trained to predict the next word, right? So in this case, you're providing some input context. It's not like deleting from the interview. And you're predicting, like, and you already know what comes next, right? So this is just regular text. How is this trained? Fine-tuned with the same objective, the original model with fine-tuned. It's just like cross-entropy laws on the individual topics that come later. Another task, I don't know another task. Another task could be, you know, here is the sentence. Please reverse the order of the words in the sentence. Then you provide the sentence in the reverse order, right? That could be one of the tasks. And, you know, the problem with this approach is that now, if you have these models, these plan C5, and you put it on a chat window interface, and I say, hey, plan C5, given the sentence, please reverse the order of the sentence. And then the model does it. You might be impressed. You might be saying, wow, the model really understood what I said, but what if you find that one of those tasks, that task was in the structure of fine-tuned text. Like, that was one of the tasks the model was extremely trained for, right? So before you get impressed by the behavior of some of these models, you know, I would say, you know, think whether that task maybe was in the fine-tuned text, right? It's not just a task the model naturally learned by reading text on the internet, right? If P5 was doing something like this, I would be very surprised, of course. But if plan C5 answers correctly one of these questions the model was explicitly trained for, then I'm not as surprised anymore, right? That's why, you know, like, I see, for instance, in this example, you know, the cafeteria had 23 apples originally. They explained to make lunch. So they had 23 minus 20 equals 23. They bought six more apples, so they have to be able to take the equal time, right? So if you give these type of questions to the model, and this is the type of question I see a lot of people are trying on this chatbot system, and the model replies seemingly intelligently like that, you know, meaning you're step-by-step reasoning, not like the model just learning computer data. There was a task that was explicitly having input output pairs like that. The model was thought to reach the pay-by-step in a very specific way. It didn't just learn what step-by-step means by reading a lot of text. I'm not saying that's impossible, that that couldn't possibly happen, but in a lot of these models that time would have happened. This is one task for which the model was explicitly trained for. These are not anymore models that are trained to just predict the next word. In fact, here's an example they gave in, I think, the 25 paper or so. Here's the input question. In the following sentences, it explains the antecedent of the pronoun, which the input partner refers to, or states that it is ambiguous. The reporter and the chef will discuss their favorite dishes. The options are A, they will discuss the reporter's favorite dishes. B, they will discuss the chef's favorite dishes. C, they can do this. The answer is the best thing step-by-step. The regular T5 model, not like T5, the regular T5 model output bad. It's all sentences that make sense. Sentences that have a high probability after those sentences that came before. The reporter and the chef will discuss their favorite dishes. The reporter and the chef will discuss their favorite dishes. But it's not really answering the question. But after they do it on these many, many tasks, yeah, the model is able to give you the answer. It's B, and it goes on A. But this is a model that was already tuned with a lot of question and answer here. After, it was retrained with next word prediction. And yeah, Plan T5 and T5, those are all that Google actually really, you can download it. I've already used them. Facebook also has their own version called OPP. And I don't remember what the O stands for. This one, OPP. I mean, GPP stands, I think, for General Freetrain Transformer. And OPP stands for Freetrain Transformer O-something. I don't even keep track of the answer anymore. I just know the model from the one from Facebook with OPP. Right? But they also have their own version of OPP that is tuned on the structure. And this is called the OPP-T-IML. I don't know that. I don't think Google and Facebook have as good marketing opinions of the AI as they believe and catchy names for their models. But yes, so Facebook has this OPP-IML. And I don't remember what IML stands for either. I think it's modeling, construction, modeling language. I'm guessing. But I bet that I must be for instruction. And so you can see here, again, some of the input-output text that are used. Some of the input-output text here that are used for tuning the model. And here are some of the input-output text that are used to evaluate the model. And in some of these papers, I mean, they're trying to also, you shouldn't read them in detail, especially the experimental section. And especially as you're working on your project report, you know, some of the links I'm putting here in the slides, I hope you reflect on that because it might be the ideas for what experiments or how to present your experiments in your project report. But here they do have an experiment where they are trying to see how well the model does on tasks that were not seen during training. Right? How well the model generalizes some type of task to other type of task. Yeah, so this is the, what people are doing now. So large language model, but now we're gonna fine-tune the model with text that is explicit input-output pairs for various tasks. Anything that you can turn in an input-output pair format, you can use it with this model. And people have already thought about two thousand tasks. You know, I would say, you know, think about a dozen tasks when you go home. And then ponder about how many tasks you would like two thousand tasks. Like I said, you can by yourself enumerate two thousand tasks. Let alone collect data for two thousand tasks. Right? Question answering, like text classification for movies, movie review story, product review classification, question answering for second grade questions, fill in the blank fill in the blank type of question, multiple choice questions about physics, about chemistry, like all of that, you know, and the model is trained with all these tasks, all together. Alright, so the last type of model, and this is what the Calc models like chat GPT become even even better than just putting on instructions. And this is a topic that really goes very outside this class. It just touches on reinforcement learning. And we haven't talked about reinforcement learning. There is a class for reinforcement learning here at Price PM on my professor and why that. So I recommend you take that class. I'm going to give you just here the average version of how and why do we need reinforcement learning all the time. Yeah, but here we are going to also collect a lot of human input and output pairs, like input questions, output answers. Right? This is what these channels do anyway. Right? You go there and say hey can you tell me what is the distance compared to the moon or can you tell me what are the you know, the free countries by oil production or something, you know. And he makes you a list. Right? He makes you a list. So you can think of all these tasks and collect the data and then finding the model with that. And that's what they are doing too. But on top of that they are also they are also collecting many answers from many people. So for instance here in step one they say collect demonstration data and train a supervised quality. What's happening here? Like, for instance, explain the moon landing was 50 year old and then you have people writing answers to that question, so many people potentially. So that once you have that so you can have the same question with many different answers that were collected from many different people. And some of the answers that people will give you in the collection will be better than others. Now you have a separate set of people looking over the answers and providing you ranking and telling you, you know, I like this answer better than that. But this is a better answer than that. So you can imagine how expensive is people like this as well. So then you are going to use that to also train a model that tells you how good an answer looks like for a given question. And then you're going to optimize the language model together so that it not only produces good answers, but the answers also get a high reward for it. Now let me do this step by step. So you're going to have your language model and you're going to tune your language model so that even some data that the model generates, you're going to train these models to give you a reward for it. So basically, if the model generates something that is aligned with what a human provided, as a score, so basically you're just training a regression model. You're training the model to predict the score that that human has to get some answer. And so here's the thing. Once you have that model, this is the model that computes the reward in a text. This is the model that you train in a text, and that model can be also based on your trained language model. So this model, you can use it by itself, you know, you make a piece of text and you do like how good is that text out of it. And so the idea is that you are going to update the model so that it produces outputs that produce a higher and higher reward under the reward function. And this is something we could normally train. We would think, you know, why don't I update the model parameter end-to-end using text. So this is one of those cases where you cannot use stochastic gradient to train the model in the text. Does anybody have an idea why not? You know, like, this model produces some text and then this model gives me a score for that text, and I want that score to be as high as possible. I could train with why I cannot. Because the reward function, you cannot do the derivative of that? I mean, the reward function is a computer derivative. I mean, this is a neural network, you can see drawing here. The problem of the derivative, you're right, there's a problem of derivative, but it's not in the reward function itself. But here, in this example, you know, this arrow that says text goes from language model to the reward, that's the part where you cannot propagate, right? The output of your language model, think about what is the output of your language model. Is the output of a language model text? It's not text. The language model doesn't really output text. The output probability distribution over topic, right? And you had experience with being in all your time, at least your last time, where I asked you to sample from the language model, I think you were experimenting with a recurrent neural network. It plays the same thing. The recurrent neural network gives you the probability score for the next call. And then you have to pick the word from that distribution, right? You can take the max, right? Max is including but that's non-differentiable, right? Like, once you, like, the model output gives you the probability score for the next time steps, and you have to pick a word. And you can only pick one word. So that's this topic, right? You're making discrete choices about which word, which topic, how to talk to the truth. It's not like an image, you know? When we have a image like the GAN model, this is also the reason why there are no good GAN models for text. How do you generate text with a GAN? So if you have a generator network, G, for an image, you pass a noise vector G, it generates your image, right? And so here you have a discriminator network, G. It predicts yes or no score. It's very similar to the generator discriminator you have in GAN, what I'm showing you there. So instead of that, you have your language model, which is your generator, it says there train LN, that would be your generator. In that case, it's generating text, but in the GAN, it generates an image. And there you have the other one that says reward, preference, model. That's your discriminator, right? Reward, preference, model. Right? So this one is generating images, and this one is telling you how good That one is generating text, and the reward model tells you how good is that text. So in a way, it's very similar to the GAN model. However, yeah, we're good. However, we need reinforcement learning there, and we don't need reinforcement learning here. Because even though it looks very similar, it's not the same. Because the outputs here are pixel values, right? Which are between zero and one, let's say. And the input to the discriminator are also pixel values, which are between zero and one. So we can, and these are continuous values, they're not discrete, right? It's between zero and one. So you can back propagate to this. And take a read that it was a single neural network in some way. That we cannot do there. Because here, the generator, which is a language model, so the language model outputs here some probability distribution over time, right? You know, here's your vocabulary, and it outputs, you know, scores here. And then you pick the one with the largest score, right? For instance, cat, or dog, or whatever, or dog, or word, whatever word. And that word cat goes back into the language model so that the language model can predict the distribution over the next word. And then here you pick maybe play, cat play, right? So you pick the word play. So now that you have the word play, and you send it to the language model, and the language model predicts the next set of scores. Cat playing maybe in I don't know or inside the house. Inside. And then you take the word inside, and you send it to the language model, and at the end of the decoding process, you have the sentence, you know, cat or cat play inside. Cat play inside. Right there. And that takes what you were sending to your reward model. You're not sending the output. These vectors which are the output, you're not sending. Also, these vectors came from the output. Also, you needed to use this output here that we're playing back to this output here. So this decoding process from the language model is non-differentiable. We're making choices at every time step. So even though my reward model can tell me whether cat playing inside is a good sentence or not, I cannot suspend that reward here. Do you know another problem like this where reinforcement learning is used usually for people who have some experience with reinforcement learning? Why do people use reinforcement learning? There's a trick used in the chat GPT called reinforcement learning human feedback. Yes, that's what we're discussing here. This is the reinforcement learning human feedback. But reinforcement learning is not a popular at least before it was not very popularly used with NLP. What is this application that most people use reinforcement learning for? Self-driving cars. Self-driving cars? What else? Decision-making tasks. Decision-making tasks can be an example. I'm asking for an example. Like chess. Games. Why games? Because games I wanted to convince you that games are like this in a way. In a game you're also making discrete choices in the game. And every time steps in a game, you're like let's say you're driving like Mario Kart you're turning left, you're turning right you're shooting someone that's the thing. You're jumping something those are discrete choices. So usually you have a neural network that this is what they call here the policy network. Right? So that's why it says here where is it? Supervised policy. Right? So you have the policy network which in this case is the train language model or like the generator. So the policy network tells you which is the action I should take. Should I turn left? Should I turn right? Should I jump? Should I press the shoot button in the game? Right? And the policy network gives you a probability distribution action. But you have to pick one. Right? You have to pick one and depending on which one you pick depends on what you're gonna do next. And also depending on which one you pick the state of the game changes in the next step. Right? So something is happening here like this. Depending on which word you hit you take cap, the max that will change which word will show up next. And usually when you're decoding a language model you're not always picking the max. In your assignment I asked you to pick the max but sometimes you have to pick you know like samples from the distribution. Right? So you have here a larger score you're more likely to pick that one but you're not picking it necessarily. And so when you're decoding a sentence it's also like you're making discrete decisions at every time. In a game you're making discrete decisions. What is the reward of the game? Usually in video games or any games you also can compute a reward function. Right? Like you're playing Mario Kart my character fell into a pit or something. Or did I lose the race at the end? But you don't get the reward until after several mistakes that you make. So something like that that's why you know requirement learning is used when you have this type of problem where you're making discrete choices and you cannot just propagate gradients or something like that. Another area that I wanted people to say is design video games. What do you think that is? Did somebody think about it? The other one is robotics. Robotics. And you know Professor Viva is an extreme robotic. Report for learning. But robotics also. Because robotics you know maybe you're like trying to chase a target or find a way out of a maze. You're also turning left and right for making discrete decisions. And when you find the target then you have maximum reward. Or as you get closer to the target you get maximum reward. Right? But your robot might not know like the whole closing from the target. But every time it does. So there's also problems there. But as long as you can evaluate any position and know which position is better than the other. You can use report for learn to adjust the way to your policy network so that your policy network gives you better action in the current state of the game. Or the current state of the board. Or the current state of the robot position in the maze or whatever it is. But not only that you know it's also robotic and you're moving like a robotic arm. It's also discrete decisions of like how you're going to move each of the actuators in the arm to achieve certain desired positions. And then you have a reward. Like if the robot sees an object or didn't see an object then you get a positive or negative reward. The closer it is to the object the better the reward. And that's why a lot of these robotic report from learning algorithms are trained on simulators. So then you can simulate the action anytime you want to make computer work for a computer model like that. Yeah, but if this was not a problem the problem would be very similar to again. And now reinforcement learning, ways to like train this with reinforcement learning. There are many ways. So PPO is one of the algorithms for training. And it's the one that people using this paper call instruct GPT. To instruct GPT is a paper published by OpenAI which people widely believe is like at least the way GPT is used eventually also. Because they haven't released many people throughout how was GPT trained or like you know it's not like the plan C5 where you know we know this was due to a 2000 task the GPT model we don't know how many tasks. They clearly seem to be recruiting people for writing code. So they're clearly recruiting code writing and coding problems. So unsurprisingly those are good coding But yeah, so there is another problem here though and why does it doesn't look exactly like the generator and the screen maker. So I've been ignoring it but you see they also have these other pros and cons. But the answer is here. We already talked about K-L loss it has come up a couple of times. The K-L divergence model you see just measuring like how different are two pros and cons. So basically this model is being trained with the reinforcement objective from the reward model. So the parameters here are changing but here you have a copy of the original model and this one you're not going to modify. So basically the probabilities you see here the probabilities of the token generator here and the probabilities of the token generator of the original you try them out to be consistent. The reason for that is because you don't want this to be generating just the other step. Just so that it fixes the reward problem. It's kind of a hack and you know it's one of the problems of reinforcement learning. I think they call it reward hacking. It might produce garbage text that gets a high reward but it doesn't look any more like actual text that you don't understand. You get the high reward and there's a reward function kind of adaptation. So to avoid that they have the K-L divergence loss which is fully differential because it's comparing the output probabilities for directly. So then you can do this. So it's basically modifying the library model but not too much so that means there is garbage. That's what it's doing. And yeah, if you're interested in a longer summary of these I need the logic from last semester's first class half the summary like this. Maybe a little bit more detailed but you will see there that one of the emphasis in that slide there is that this is a very rapidly evolving area of work and whatever is being said might look very different next year. But this is more or less what happens with the model. The question so far is where do I go from here? Yeah, so you're asking for the good structure to be the reward in the learning process that you talked about. Is that at all meaningful or meaningful or possible to be managed in place? So from what I understand open-ended writing is the right word possibly if I replace that with a large public model and just do the job for me. Yes, absolutely. That's what people have done. And in fact I think there's a group of students at Stanford I think in a class like this who made it as a class project. They took outputs from ChatGPT and they tuned one of the models that are open source with those outputs from ChatGPT and they quickly could get a model that looked like it could behave like ChatGPT. And they had to take it down because then OpenAI was suing them. Because they're basically scooping out all the knowledge of ChatGPT and maybe they potentially paid for it. But yeah, I don't think it's good. Yes, you can do that. At the end of the day, somebody has to do the job of collecting the data anyway. You can try to extract that much information from the lab that was all in the model. But why sue? That's model discrimination, right? You are converting the model that's recitarian Why sue? Yeah, I don't know. Because potentially they invested lots and lots of money on paying labelers to label that data. Yes. Well, the thing is that a lot of people feel ambivalent about this from an ethical perspective because they are also sizzling away a lot of public data from the internet to make it work at that level. I don't know how I feel personally. I wouldn't be mad at those students at all. Any other questions? Alright, so the next step is, well, I'm not saying it in the slides, but this is like into the world of GPC4. And GPC4 was released as we were taking these classes, right? And there are even less details about how we want to go forward. There's not a lot of details about how we want to go forward. We know even less. But there are efforts in the computer vision NLP community to do something like that. They're happy, right? Aside from like, for instance, we talked about some way of combining language models with images. So that your language models your GPC4 or your T5 model or your class T5 model they only take text as input and predict text as output. But can we make them take images also as input? That's the question right here. And it turns out that there is a way that it's not that hard. You don't need a whole lot of resources to make it work. And so basically this is the this is the idea. So a language model a language model takes input text and produces output text. So here is a language model and each of these right here are the input text embeds. Right? Which you know you have to take your text usually you tokenize your text once you have tokens for your text you convert each of your tokens into a number and you can embed a vector and that embedding vector goes into your language model. So after all you have to convert a number and then that number into a vector and then that vector that sequence of vectors is made. So a language model really just takes an input a sequence of input vectors that have continuous classes. So what they propose in this word is that they're going to train a small well it doesn't have to be small but you can train an encoder a visual encoder model. Maybe a PNN maybe a visual transformer but the output of these will be two embedding vectors of the same size as the embedding vectors the language model takes input. So in your assignment number one you have you have to fine tune a PNN to classify twenty each category of language or each. So you modify the last layer of the PNN so that instead of outputting a thousand output scores you output twenty output scores. So you change the last layer of the model so that you can fine tune you could also change the last layer of the model so that instead of producing one vector of size twenty it produces two vectors of size hundred and twenty eight. Yeah, so you could add some two linear layers that output two vectors of the same size as the vector that the language model takes. The problem with that is that if you just do that the language model will not understand what these vectors are. The language model will not train with images. Just because you are sending images as input, the model will not magically. So you have to train the model with these new type of images. And so this is what they suggest in the paper. We are going to take a language model that was already trained with text models. And we are going to tune it with a data set that has images with text images. Like this one right here. So you can see here an example. You can see the image of the boat and the description says all red boats on the water. You are going to be training these models with these type of images. But here is the key thing. You know when you are finding a model you are destroying information in the model. That is another problem. This is a model that is a very large type of model that is trying to take to complete text at a very high quality. And by cleaning it on a small data set you are destroying that model. So the solution they found here, and this is the key idea in this work is they keep the language model layer frozen. So you can see here it says frozen. And here the embedding layer of the language model is also frozen. Meaning like you are not going to modify the language model. The only thing you are going to modify is the visual encoder. Maybe just the last linear layer that will use these two vectors. That is all you are going to modify. You might have to do these projects in plain English. What you are going to be doing is you are going to trace a layer here. This layer is red. It takes the output of that image and it translates that output into the type of input this model expects. So basically you are going to take the output of this model and you are going to convert it into text from any type of output. So that the model can make use of it. Without disturbing what the model already knows. Because these models already do that. Basically you are just modifying the input here. So that it generates still a good output. So you are still going to have a classification loss here. And you are going to be doing XGB to train the parameters. But you are going to skip any of the parameters in the Laplace model layers. You are going to be updating the linear layer of the CNN or maybe the entire CNN. But you are not going to be updating the Laplace model. So if you send text input to this Laplace model you will still be able to do the exact same quality because you haven't touched it. And you can do that. So once you have trained the model like this you can use it in all these other ways. And this is what they try. For regional question answers you can now send the EMMA to the digital encoder and the question to the text encoder and then it will give you the answers. Yeah. So for training does it turn all the water avoided as a filter? Oh, during training, what is shown here is during training in one step of training like in one batch of training you are going to be masking these parts. So this is an ultra regressive model like GPT-2 so this is not provided on the water. So you just provide a small red boat and you are trying to predict just on the water. So that is why it has the red arrow here. That red arrow is saying like back up against the small pool. So when the red water is creating this data set then it has to do with the data set No, no, no, no. The data set has the EMMA with the pool sample. And during training you are the one who are like not only like masking some of these colors. So next time you see the training sample maybe you will mask up to here. Maybe next time you will mask the whole thing. Because you have the EMMA it should be able to do that. Yes. One question is why do we use two basically what is that called? Why do we use two tokens instead of just one? That happens to be the right amount of parameters to compress. Yes, and they have an experiment where they tried to do just one and just one and then three and they found that two was a good concept. You cannot pack and I'm actually surprised you can pack most images information in just two tokens. But you know these two tokens are high dimensional. There is a lot of information to pack. And remember this is a language model that is already trained to answer questions. Not about images but general questions. Like if you ask the model what color is the cat you will be able to answer some colors because you already know that if you're asking what color is something the answer can only be like from a very finite set of possible words corresponding to colors. But you need an image to really know which one is the correct color in this case. But it says what color is the car Now the model understands not just the text. The model was trained just with images with captions. Not with images as questions. But now the model is able to extract image information because it is trained with images. And you can also do the few-shot learning like the few-shot prompt into the model. You can send an image maybe to the doc an image is the ticket an image what is this and it says you can say that I think that that is not a real object it is made of words it is just trying to show that it can recognize what that is or maybe that or maybe that or maybe that Or maybe that maybe I am trying to I think that I think that I think that Google has also sorry, if you mind, you leave also a new one called flamingo you can also do cross engineering like this you send a schema you send an email you send an email you send an email you can read the text in the email yeah, so you can also do cross engineering yeah, so you know, like before I also think that input images it is probably doing something like this not like some adaptation of the original one anyway, yeah we are taking our time so next class I am going to be giving you an in class excuse me, so I will I will recommend you will just stay on I am going to connect and we will meet next time I will be in town


# Backend
(refer to https://github.com/Sentdex/ChatGPT-API-Basics/blob/main/chatGPTAPIbasics.ipynb)

In [None]:
transcript = data['text']

In [None]:
words = transcript.split()
num_words = len(words)
print("Number of words:", num_words)

In [None]:
import openai

# load and set our key
openai.api_key = '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'

In [None]:
"""
A helpful rule of thumb is that one token generally corresponds to 
~4 characters of text for common English text. This translates to roughly 3/4
of a word (so 100 tokens ~= 75 words).
"""
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Tokenize the transcript
encoded_input = tokenizer.encode(transcript, return_tensors="pt")

# Decode the tokens back to text
decoded_output = tokenizer.decode(encoded_input[0])

print(encoded_input)
print(decoded_output)

In [None]:
print(f"The number of tokens is equal to {encoded_input.shape[1]}.")

In [None]:
prompt = """I will give you a lecture transcript. I want you to convert this transcript into a student's class note in the form of latex. The goal of the note is that it is structurally easier to read and understand. Additional requirement is that it preserves the complete information by drawing reference to specific section of the transcript, logically and chronologically progressive, well organized and structured. Here goes the lecture transcript: """

transcript_sec_1 = ' '.join(transcript.split()[:1008]) # feed only the first 1008 words

In [None]:
prompt = """I will give you a lecture transcript. I want you to convert this transcript into a student's class note in the form of latex. The goal of the note is that it is structurally easier to read and understand. Additional requirement is that it preserves the complete information by drawing reference to specific section of the transcript, logically and chronologically progressive, well organized and structured. Are you ready?"""

transcript_sec_2 = ' '.join(transcript.split()[:1008]) # feed only the first 1008 words

### transcript_sec_1

Coherent set of attitudes. I haven't contradicted myself. But rationality is not a matter of not contradicting yourself. There are lots of ways to not contradict yourself that are plainly irrational. And so the mere fact that this is a coherent set of attitudes is not enough to show that we're being rational. Here's another way you could do it. Think about a T2 star, which I just need to be a variant on T2. So Sam comes in, Sam says that it's snowing, so I become highly confident that Sam said that it's snowing. Suppose I want to, I still think, suppose I'm a weather channel junkie and I'm very confident that it doesn't snow in Houston. Another way I could maintain my coherence is to give up on my confidence in Sam's veracity. I could think, well, Sam said it's snowing, but it's definitely not snowing. So I guess Sam is not as reliable as I thought he was. That's another way to maintain coherence. Now, Hume's idea, though, is that this is a bit different. We shouldn't do either of these things. Hume's idea is that if I'm highly confident in this and then if I started out highly confident in this, I'm sorry, if I started out no attitude here, highly confident that this is true, highly confident this is false, and then I become highly confident this is true, Hume thinks that there's a kind of, as we'll see on each of the quote in a bit here, there's a kind of mutual destruction between my confidence in the truth of two and my confidence in the falsehood of three. And so what should happen is I should become less confident in two. Instead of being highly confident, I should be, I don't know, moderately confident. And I should become much more confident in three. So instead of being very confident that it's false, I should be having a moderate degree of confidence. And so Hume's idea here is that once I become highly confident that one is true, I should become medium confident in both two and three. OK. Any questions about the proposal before we leave this yet? I'm just a little bit confused, like, logic-wise, how we have, like, half-competences on something. Yeah. Good. So if you've taken a logic class, then you'll recognize that this is a pattern that you're not going to be able to see. And you're not going to be able to see the pattern that you're going to be able to see. And then you'll recognize that this is a pattern. So this is the basic puzzle, is that we have three sentences, three propositions, and they can't all be true. This is an inconsistent trial. And so the truth of any two of the sentences guarantees the falsity of the other. And the situation we're imagining is where you're highly confident that one is true, no attitude towards two and three, and then you become highly confident that one is true as well. And now that requires you to adopt a high confidence in three. OK, now if you're used to thinking in terms of logic, in terms of the way you study these things in a logic class, you don't think about these sentences in terms of truth or in terms of how confident you can be that they're true. You think about them in terms of whether they are true or false. Because in logic, logic is what we call an alethic enterprise. It is where we trace out the relationship between the truth of one sentence and the truth of another sentence. What humans do in here is not logic. What humans do in here is epistemology. And epistemology is different from logic in that we're not talking directly about what is true and what is false. We're talking about what attitudes it is appropriate to adopt or rational to adopt. What should I believe? And how confident should I be in that belief? And that does not track logic cleanly. Strangely enough, logic and epistemology are quite a distinct enterprise. There are many logical truths, many relations between the two values of sentences, which are such that it would be irrational for you to believe it. And there are many things that are rational for you to believe, which nonetheless fail to track logic and these sort of things. So the question we're asking ourselves is, what should I believe? And for Hume, as we'll see, the question's really, what should I believe given my evidence? And that's very different from the question of, what does logic guarantee to be true? So there's another bit of a suppressed point in here. So I'm talking about, what should I believe? And then I go back and I start talking about levels of confidence. What's going on there? Well, Hume is adopting a specific view about what a belief is. We often talk in this kind of inconsistent ways about belief. Sometimes we talk about, I believe this and I don't believe that, where belief is a kind of categorical attitude. Either I believe it or I don't. But sometimes we talk about belief as a kind of commitment towards the truth of a proposition, where that commitment comes in degrees. So you can be more or less confident in some proposition. I am extremely confident that Boston is the capital of Texas. I am pretty confident, confident that Montpelier is the capital of Vermont. I am not at all confident about, is Tulsa the capital of Oklahoma? I don't even know. Oklahoma, like, never been talked about in Oklahoma. I have no idea what the capital of Oklahoma is, right? So I have these different attitudes towards these different propositions about state capitals. So the idea here is that this is at least some indication that belief is not an all or nothing matter. It's not categorical in the sense that for a sentence, either I believe it or I don't.

### transcript_sec_2

We'll be honest, we've already done it, but I'm interested in hearing more from you. So, here are my comments when we start talking about this. On Saturday, the idea that we're going to have to update the directory to secure the entry by entry. After you update the data, it's forgetting to be block by block. But once you have a getting block, when you input the directory to block, it's going to assume that they're entering by entry. At the end, they draw the same length, originally. Now we're going to allocate the space, and within the block, I'm going to have to calculate the space, what I call jump by jump. Jump by jump to jumps are bounded in both ways. That sign was chosen so that no matter what the actual sacred size is, no matter what the actual block size is, no shown camera stands across the sacred boundary or across the block boundary. What that means is you can always read the directory jumps in a single disk operation on the right of the directory jumps in a single disk operation. That means you're going to have a lot less of a problem with the potential of, for example, writing a jump that might stand across the block or a sacred boundary. If it does stand across the boundary, you're going to be writing a block right here, so you're going to have consistent crashes after writing the first half of the jump before writing the second half of the jump. Now, these jumps are written entirely on one operator, and they're expanded across the block and across the sacred boundary. Within the jump, the reality of the space is in very precise pieces. So these familiar things are a mild name. Here they add the null termination operator, a great reason to get exposed for convenience, and they allocated the item in there. But there's two other fields we talked previously about. On Thursday, we have the length of the name. The length of the length of the null termination basically tells you how large this field is or how many characters in this field are important or relevant. But the record length field tells you the length of the entire table. The record length field is independent of the name length field because the record may be longer than each city that represents just that name. If the whole size of the entire record would fall in the size of the name plus the size of the item number plus the size of the name length field, there would be no need for a record length field because the record length would be applied to the size of the other pieces. But the record length field being independent allows the entire record to be larger than it used to be. And the reason for that is the 512-byte chunks are always exactly full. Suppose you have your name that don't exactly go up the 512-byte and the last record in the chunk, the length of that record is longer than it used to be. Basically, that last record is as small as up the remaining space within the chunk to fill out the rest of that chunk again. When you put a directory entry in the class divergent, no matter what, you can change the item number to zero and that directory entry is effectively considered to be going to be technically still there, but being bored when you read over it is still catching the price of the size of the directory, but it does not determine the actual directory entry. And most of the time here we don't do that. Instead, because the name length field is not going to be perfectly independent, and you delete a directory entry, you simply take the directory entry and report it in the same chunk. If the chunk consists of a bunch of directories, you delete the directory entry out of the middle of that chunk, and you're created for that to be the same chunk. It's larger. You can increase the size of the small of the space that the directory you just deleted used to consume. Now again, the chunk is now 100% full. There's no entity space in the chunk that the directory you are going to use to be. You can't do that, though, if you delete the first directory entry of the chunk, because just to make the space that was left by the first directory entry were consumed into the previous directory entry. Now we're just going to be crossing the chunk down. We want to do that. As I said, for the first directory entry chunk, they still change the item number to zero. Let me show you an example of that. Here's a simple chunk that has three directory entries in it. Finally, it takes you to test us in, and we're going to make one, three, four. So the size of all these fields, and I'm going to show the exact number of examples here, the name length and the record on fields are twice each. That means that you could have a directory entry with the name of that directory entry being basically up to 65,000 characters long, and don't actually use it for that. The actual limitation is 255 characters. But I guess they sign in. Some of the signed decisions on the document have the exact reason for some of these decisions. Important ones, they document the less important ones, they just hit it. So I assume that the format that shows them could, inside some days, make the names being too long, 255 characters. But really, 255-character file names is pretty long enough. You want to really type, you know, CD to primary character name. Probably not. But the representation of the names is up to two to the 16th. So the two and two bytes, right? Two bytes each. The item number used to be two bytes. In my case, it's two bytes. The class name uses two bytes. Here's the item number. These four bytes just allow for more individual files to be part of a really large file system. These files have a unique item number. There's a name, download terminated, and the name length field. It might include that in the byte. So if you look at what we have up here... All right. We have ABC... Oh, that's four bytes. Three is the length of the name, not included, also just the ABC part. Seventeen is just the item number. That should be the range number in my example. And what would the correct one be? Well, it's these four. That's four plus these two is six plus these two is eight. Plus... What do I have right here? Two, four... Two bytes, that's four. Four bytes, that's eight. Should be four bytes. Should be four bytes for this, two bytes for this... Let me start over. Four bytes, four bytes for this, two bytes for these is now a total of eight, and four bytes for this is a total of twelve. And that's what the perfect length field is, twelve. Are we gonna break my mouth for you? Thank you. All right, so basically, this is what it looks like before we start thinking about what happens when you delete something or other. The entry for 1, 2, 3, 4 covers all the way to the end of the block. So the size is 485 instead of just the size of what would be 13 bytes. If I delete test.c, what happens is the record for ABC is now 12 bytes long. The record for ABC now becomes 12 plus 15 bytes of test.c. So it becomes total of 27. The record is still full, the chunk is still full. If I delete ABC instead of leaving test.c, if I test.c is still here, but I delete ABC, then I have to change it to 17, and I have to correct it to zero, because there is no previous entry. So basically the idea is nothing stands across the chunk, block boundary or sector boundary. Within a chunk, it's sort of a little more complicated than the last whole format, and you try to have a big space. For example, when you create a new file, you have to read through the directory to find out whether the file name already exists. As you read through the directory, you can say, why fast as you read through the directory, you can remember the empty, fixed-sized blocks that you've seen, and the name doesn't exist. You know where you're now putting the new name and where you're going to be creating it. Here, as you read through the main directory, you can see the name already exists. You can see whether there's an unused space where the size of the record line field is the size that it needs to be. You find the first space in the directory that largely contains the new entries that you might be creating. You can read all the way to the end of the directory and find out if that name already exists in the directory. But if it doesn't exist in the directory, and now he doesn't write the file once, and the directory once, you know where you put that space. You don't actually have to go explicitly hunting for space to put the new entry. You sort of get that as a byproduct of having to decide what the new name exists in the directory already or not. Okay, next topic is better locality. Basically, we want to arrange things so that without really too much extra work, we can keep things near other things. The data blocks of a single file near other data blocks of the same file. The data blocks of a file near the item of the same file. We want to hopefully have fewer seats. Now, before I start writing this, if you remember, we already talked about, well, you can't do disk editing anymore We don't know where anything is anymore. We don't know physically how far apart anything is. In the classical view of how operating systems talk about disks, you do the entire geometry of the disk. You know the exact number of services and the number of sectors for tracking. For tracking, you do exactly everything. So you can do the math, you can do exactly which cylinder everything was in. So you knew how far apart everything was. We're going to try to get better locality. Locality in terms of what? We don't know where anything is. How do you do locality when you know where nothing is? And the reality is that we're really the same way. When this was really designed, cylinders were still cylinders. We still knew where everything was. We hadn't actually introduced the so-called logical block addressing yet into the hardware disks unless in the operating system interface. So with this really designed, you knew exactly where everything was and locality really meant locality. When we talked about disk scheduling in today's world and how disk interface works, logical block numbers that are numerically near each other are still generally physically near each other. So if you put things in two blocks that are numerically block numbers near each other, then you do know you put them almost always when you put them physically near each other. You don't know some subtle things like automatic bad block forwarding, but both have very clear records. Hopefully in most cases today, you have no automatic bad block forwarding. You have no bad blocks. But you may have bad blocks. When this was first designed, the manufacturing of disks was not as perfect as it is today. You have this very uniform layer of magnetic oxide coating, so you can just deposit that and make it a totally perfect surface. The technology for that has gotten better. So even the manufacturing of disks used to have some bad blocks sometimes. Today, your hard disk probably has zero bad blocks on it, unless you're not lucky. So we're still going to be able to make use of a noticeable count that's relevant enough to still get things near each other and make the performance of using the files much better than it would be if we weren't trying to do this. So just like we talked about, you know, variable-length filings has also been introduced in the so-called FAST file system from the University of California, Berkeley. This is what the file system was named after. The FAST file system, this nation's file system, FAST variable-length filings, that can make it minuscule, the balance is lower, but on that matter, the file system is still quite a bit faster because of this change. So what we're going to do is divide the total amount of disk space into a sequence of cylinder groups. So originally, I think cylinder groups are really, literally a consecutive range of cylinders. Last thing, between cylinder number 5 and cylinder number 10, those are all near each other, and at most, they sure seem to be apart from each other. Cylinder number is what matters in terms of distance, and rotation, we have no control over. So putting things in, two things together in the same cylinder is, the bus you control in the bus that you pick in nearby cylinders is still pretty good. You don't want them in far apart cylinders. So we're going to divide the surface of the disk into these so-called cylinder groups. Now it's really in the range of imaginary cylinders, but it's still a range of, like, the block numbers that are numerically near each other and thus still tend to be near each other. And each cylinder group is essentially like a little miniature file system itself. Each cylinder group contains a redundant copy of the superblock. They also rearrange what they put in the superblock, so now you set up one superblock. With one superblock, you each can keep the one copy consistent with itself. There's only one copy. But if you have multiple copies, when I write this copy, do I have to now write only the other copies? No, because now the superblock is read-only. So the data structure that used to be... The superblock contains some information that's always been read-only, like the total size of the files and so forth. The superblock also used to contain, for example, the beginning of the free blocks, the beginning of the free items. That's now been removed outside the superblock. So the superblock does have to be able to read-only information. The beginning of the redundant copy can be in a very cylinder group. So now we get to the managers. The superblock, when you need to go read it, there's always a copy nearby, obviously called the column on a seek-away. And if you have a hard disk, if you have one clock like that, what happens if your superblock blocks with that? Now we have backup copies in case of failure, as well as just the thing nearby. So we've got a copy of the superblock, and we do so by moving the intervials into the beginning of our little intervials. The subset of the total disk space. You might have a range of consecutive cylinder numbers. That means I have a subset of the total storage capacity in each cylinder. This is the blocks that are in that set of cylinders. I've set up the items. Instead of having all the items together at the very beginning of the disk space, I have a little pool of items that can be in each cylinder group. I have a separate list of free blocks in the cylinder group. I have a separate list of free items in the cylinder group. All the information within the cylinder group constitutes everything I need to use that cylinder group as such as a little miniature file system. It's not exactly used that way, but we take advantage of the ability to do things within the cylinder group, because everything is still contained within the cylinder group. So now we have the disk of that up. We have to decide what to put where. Should I put everything in one cylinder group, and when that fills up, put it in the next cylinder group? When that fills up, go to the next one? Should I put things sort of randomly in different cylinder groups? That's great things. Should I just put them in randomly? But now, before the file system changed, before we had cylinder groups. Not literally so, but you can almost imagine, almost think of it as every block that gets added to any file gets put in a random block number sub-run surface of the disk. We had really no control over which block that we got. We should take the first block of our free list, and it might be anywhere. Now we have a free list in each cylinder group. So all I've got to decide is which cylinder group to put the next block into. I've got to decide which cylinder group to use the next line up from. The subs of the inodes are still pre-allocated, but the subs of the inodes are at the beginning of each cylinder group. So when I create a new file, which cylinder group should I use the inode out of to make that new file? When I enlarge the file, rather than block the file, should I put that block in the same cylinder group as other parts of the file? Should I put that block in a different cylinder group? Instead of policy, decide which cylinder group to use for what and when. The first

In [None]:
completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo", # this is "ChatGPT" $0.002 per 1k tokens
  messages=[{"role": "user", "content": prompt + transcript_sec_2}]
)

In [None]:
completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo", # this is "ChatGPT" $0.002 per 1k tokens
  messages=[{"role": "user", "content": prompt}, {"role": "assistant", "content": "Yes, I'm ready."}, 
            {"role": "user", "content": f"Here goes the transcript: {transcript_sec_2}"}]
)

In [None]:
reply_content = completion.choices[0].message.content
print(reply_content)

### reply_content: Phil lecture note

\documentclass{article}

\begin{document}

\section*{Coherent Set of Attitudes}

\subsection*{Introduction}
In this lecture, we will be discussing Hume's idea of what it means to be rational and coherent. Contradicting oneself is not the only irrational act, and simply having a coherent set of attitudes is not enough to show that one is being rational.

\subsection*{Example}
Consider the following example. Suppose Sam comes in and says that it's snowing. Initially, we have no attitude towards whether it's snowing or not. However, upon hearing Sam's statement, we become highly confident that Sam said that it's snowing. We also have a strong belief that it does not snow in Houston. To maintain coherence, we have two options. We could either give up on our confidence in Sam's veracity or give up our belief that it does not snow in Houston. 

\subsection*{Hume's Perspective}
 Hume's idea is that we shouldn't do either of these things. If we initially have a high level of confidence that it does not snow in Houston and then become highly confident that Sam said it's snowing, there is a mutual destruction between the confidence in the truth of two and the confidence in the falsehood of three. Instead of maintaining a high level of confidence in both, we should become less confident in two and more confident in three. 

\subsection*{Rationality and Beliefs}
Hume's perspective of belief is that it is not an all-or-nothing matter. We often talk about belief as a categorical attitude, but it can also involve a degree of commitment towards the truth of a proposition. Rationality is not solely based on the truth of a statement but rather the appropriate attitudes to adopt based on evidence. 

\subsection*{Conclusion}
In summary, having a coherent set of attitudes is not enough to demonstrate rationality. Rationality depends on the appropriate attitude to adopt based on evidence, even if it does not strictly align with logic. 

\end{document}

### reply_content: 421 lecture note

\documentclass{article}

\begin{document}

\section{Updating the Directory}

There is a need to update the directory in order to secure the entry by entry. The process involves updating the data and then forgetting to be block by block. Once a getting block is available, the directory can be entered to block, which assumes that they are entering by entry. The space allocation within the block requires jump by jump calculation which are bounded in both ways. This ensures that there is no writing of a jump that might stand across the block or a sacred boundary. By writing a block after the first half of the jump before writing the second half of the jump, there are consistent crashes that can be avoided. The jumps are written entirely on one operator and are expanded across the block and across the sacred boundary.

\section{Fields in the Directory}

There are several fields that are crucial in the directory. The name field has a length that can be calculated from the null termination operator. The record length field is independent of the name length field, indicating the length of the entire table. The record length field is necessary when the entire record is larger than the size of the name plus the size of the item number plus the size of the name length field. The chunk size is always full in the 512-byte chunks. Hence, if a name does not exactly go up to the 512-byte, the last record in the chunk will be longer than expected. 

\section{Deleting a Directory Entry}

When deleting a directory entry, the item number can be changed to zero, and the directory entry is considered to be there technically, but it is ignored when reading over it. The space is still occupied by the size of the directory, but it does not determine the actual directory entry. It is also possible to delete a directory entry and place it in the same chunk. However, this may increase the size of the small space. 

\section{Example}

Consider a chunk containing three directory entries. If one entry is deleted, the chunk's size remains the same, but the space used by the deleted entry is free for use. However, if the first entry is deleted, a new chunk is created with a size that is larger than the previous chunk since the space left by the first entry is merged with the previous directory entry. 

\section{Limitations}

The name length and record on fields are twice each, and the representation of the names is up to two to the 16th. The item number used to be two bytes. The limitation of the names is 255 characters, and this is considered long enough. 

\end{document}

#User Interface

In [None]:
message_history = []
# What is the moon's circumference in km?
user_input = input("> ")
print("User's input was: ", user_input)

In [None]:
message_history.append({"role": "user", "content": f"{user_input}"})
print(message_history)

In [None]:
completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=message_history
)

# Now we can print the response:
reply_content = completion.choices[0].message.content
print(reply_content)

In [None]:
# note the use of the "assistant" role here. This is because we're feeding the model's response into context.
message_history.append({"role": "assistant", "content": f"{reply_content}"})

In [None]:
# which moon is that in reference to?
user_input = input("> ")
print("User's input was: ", user_input)
print()
message_history.append({"role": "user", "content": f"{user_input}"})

completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=message_history
)

reply_content = completion.choices[0].message.content
print(reply_content)

In [None]:
message_history = []

def chat(inp, role="user"):
    message_history.append({"role": role, "content": f"{inp}"})
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=message_history
    )
    reply_content = completion.choices[0].message.content
    message_history.append({"role": "assistant", "content": f"{reply_content}"})
    return reply_content

for i in range(2):
    user_input = input("> ")
    print("User's input was: ", user_input)
    print(chat(user_input))
    print()

# Q & A model

In [None]:
!pip install transformers
!pip install sentencepiece
!pip install torch
!pip install datasets

In [None]:
from transformers import pipeline

# Instantiate the pipeline
qa_pipeline = pipeline('question-answering', model='bert-large-uncased-whole-word-masking-finetuned-squad', tokenizer='bert-large-uncased-whole-word-masking-finetuned-squad')

# Define the question and the context
question = "Which NFL team represented the AFC at Super Bowl 50?"
context = "Super Bowl 50 was an American football game to determine the champion of the national football. League. NFL. For the 2015 season. The American Football Conference. AFC. Champion Denver Broncos defeated the National Football Conference. NFC. Champion Carolina Panthers 24-10 to earn their third Super Bowl title. The game was played on February 7, 2016. At Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl. The league emphasized the ''Golden Anniversary'' with various gold-themed initiatives. As well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals. Under which the game would have been known as ''Super Bowl L''. So that the logo could prominently feature the Arabic numerals 50"

# Use the pipeline to get the answer
answer = qa_pipeline(question=question, context=context)

# Print the answer
print(answer)

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

def t5_question_answering(context, question, model_name='t5-small'):
    # Load the tokenizer and model
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # Define the input prompt, question, and context
    prompt = "question: " + question + " context: " + context

    # Encode the input string using the tokenizer
    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    # Generate the answer using the T5 model
    output_ids = model.generate(input_ids=input_ids)

    # Decode the output tokens into a string
    answer = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    return answer

# Sample context and question
context = "In 2021, OpenAI released the third version of its language model named GPT-3. GPT-3 is capable of understanding and generating human-like text."
question = "What is GPT-3?"

# Get the answer
answer = t5_question_answering(context, question)
print(answer)

In [None]:
import torch
from transformers import XLNetTokenizer, XLNetForQuestionAnswering

def xlnet_question_answering(context, question, model_name='xlnet-base-cased'):
    # Load the tokenizer and model
    tokenizer = XLNetTokenizer.from_pretrained(model_name)
    model = XLNetForQuestionAnswering.from_pretrained(model_name)

    # Tokenize the input
    inputs = tokenizer.encode_plus(question, context, add_special_tokens=True, return_tensors='pt')
    input_ids = inputs['input_ids'].tolist()[0]

    # Get the model output
    outputs = model(**inputs)
    answer_start_scores, answer_end_scores = outputs.start_top_index, outputs.end_top_index

    # Find the best answer
    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    # Convert the answer tokens to a string
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    return answer

In [None]:
# Sample context and question
context = "In 2021, OpenAI released the third version of its language model named GPT-3. GPT-3 is capable of understanding and generating human-like text."
question = "What is GPT-3?"

# Get the answer
answer = xlnet_question_answering(context, question)

In [None]:
print(answer)

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

def roberta_question_answering(context, question):
    model_name = "deepset/roberta-base-squad2"

    # a) Get predictions
    nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
    QA_input = {
        'question': question,
        'context': context
    }
    res = nlp(QA_input)

    # b) Load model & tokenizer
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    return res['answer']

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
res = nlp(QA_input)

# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
res['answer']

In [None]:
from datasets import load_dataset

# Load the SQuAD dataset
squad_data = load_dataset("squad")

In [None]:
# Access the train and validation splits
train_data = squad_data["train"]
validation_data = squad_data["validation"]

# Print the number of examples in each split
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(validation_data)}")

In [None]:
# Can access a portion of training or validation sets.
# Get the first 10% of `train` split.
small_train_data = load_dataset("squad", split='train[:10%]')
validation_data = load_dataset("squad", split='validation[:10%]')
# Print the number of examples in each split
print(f"Number of training examples: {len(small_train_data)}")
print(f"Number of validation examples: {len(validation_data)}")

In [None]:
print(f"The first row of small_train_data: {small_train_data[0]}")
print(f"The first row context: {small_train_data[0]['context']}")
print(f"The first row question: {small_train_data[0]['question']}")
print(f"The first row answers: {small_train_data[0]['answers']}")

In [None]:
answer = roberta_question_answering(context = small_train_data[0]['context'], question = small_train_data[0]['question'])

#answer = xlnet_question_answering(context = small_train_data[0]['context'], question = small_train_data[0]['question'])

In [None]:
answer

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

def distilbert_question_answering(question, context):
    model_name = "distilbert-base-uncased-distilled-squad"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)

    inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
    outputs = model(**inputs)
    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits) + 1
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))

    return answer

In [None]:
answer = distilbert_question_answering(context = small_train_data[0]['context'], question = small_train_data[0]['question'])
print("Answer:", answer)

In [None]:
! pip install transformers
! pip install sentencepiece

In [None]:
from  transformers  import  AutoTokenizer, AutoModelWithLMHead, pipeline

model_name = "MaRiOrOsSi/t5-base-finetuned-question-answering"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)
question = "What is 42?"
context = "42 is the answer to life, the universe and everything"
input = f"question: {question} context: {context}"
encoded_input = tokenizer([input],
                             return_tensors='pt',
                             max_length=512,
                             truncation=True)
output = model.generate(input_ids = encoded_input.input_ids,
                            attention_mask = encoded_input.attention_mask)
output = tokenizer.decode(output[0], skip_special_tokens=True)
print(output)

# DistillBert and Roberta

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import json
import os
import pandas as pd

In [None]:
! pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m96.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m119.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


In [None]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

--2023-04-27 16:09:41--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4854279 (4.6M) [application/json]
Saving to: ‘dev-v1.1.json’


2023-04-27 16:09:41 (307 MB/s) - ‘dev-v1.1.json’ saved [4854279/4854279]



In [None]:
f = open('/content/dev-v1.1.json')
dev_json = json.load(f)

Helper for extracting a QA pair

In [None]:
import json

# Function to visualize the structure of a JSON object
def visualize_json_structure(json_data, indent=0):
    for key, value in json_data.items():
        if isinstance(value, dict):
            print('  ' * indent + f'{key}:')
            visualize_json_structure(value, indent + 1)
        elif isinstance(value, list) and len(value) > 0 and isinstance(value[0], dict):
            print('  ' * indent + f'{key} (list of {len(value)} objects):')
            visualize_json_structure(value[0], indent + 1)
        else:
            print('  ' * indent + f'{key}: {type(value).__name__}')


# Visualize the JSON structure
visualize_json_structure(dev_json)

data (list of 48 objects):
  title: str
  paragraphs (list of 54 objects):
    context: str
    qas (list of 30 objects):
      answers (list of 3 objects):
        answer_start: int
        text: str
      question: str
      id: str
version: str


In [None]:
len(dev_json["data"][0]["paragraphs"])

In [None]:
dev_json["data"][0]["paragraphs"][0]["qas"][0]

In [None]:
dev_json["data"][0]["paragraphs"][0]["qas"][0]['answers']

In [None]:
dev_json["data"][0]["paragraphs"][0]["qas"][0]['answers'][0]['text']

In [None]:
dev_json["data"][0]["paragraphs"][0]["qas"][0]['question']

In [None]:
from transformers import pipeline

distillbert_model = "distilbert-base-uncased-distilled-squad"

# Create a question-answering pipeline
distillbert_pipeline = pipeline("question-answering", model=distillbert_model, tokenizer=distillbert_model)

# Define the context and question
context = "42 is the answer to life, the universe, and everything."
question = "What is 42?"

# Get the answer using the pipeline
result = distillbert_pipeline(question=question, context=context)

# Print the answer
print("Answer:", result["answer"])

Answer: the answer to life, the universe, and everything


In [None]:
from transformers import pipeline

roberta_model = "deepset/roberta-base-squad2"

# Create a question-answering pipeline
qa_roberta = pipeline("question-answering", model=roberta_model, tokenizer=roberta_model)

# Define the context and question
context = "42 is the answer to life, the universe, and everything."
question = "What is 42?"

# Get the answer using the pipeline
result = qa_roberta(question=question, context=context)

# Print the answer
print("Answer:", result["answer"])

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Answer: the answer to life, the universe, and everything


In [None]:
def get_QA(qa_pipeline, context, question):
    return qa_pipeline(question, context)["answer"]


In [None]:
context1 = "42 is the answer to life, the universe, and everything."
question1 = "What is 42?"
get_QA(qa_roberta, context1, question1)

'the answer to life, the universe, and everything'

In [None]:
dev_json["data"][0]["paragraphs"][0]["qas"][0]['answers'][0]['text']

In [None]:
# from tqdm import tqdm

# pred_list = []

# def test_run_time(qa_pipeline_name, article_idx_list):
#     count = 0
#     for article_idx in tqdm(article_idx_list):
#         article = dev_json["data"][article_idx]
#         for paragraph_idx in range(len(article["paragraphs"])):
#             paragraph = article["paragraphs"][paragraph_idx]
#             paragraph_context = paragraph["context"]
#             for qas_idx in range(len(paragraph["qas"])):
#                 qas = paragraph["qas"][qas_idx]
#                 question = qas['question']
#                 ans = get_QA(qa_pipeline_name, paragraph_context, question)
#                 # print(ans)
#                 count += 1
#                 pred_list.append()

#     print(f"Total number of QA pair answered {count}")
#     return (0)

In [None]:
from tqdm import tqdm

def build_ref_list(article_idx_list):
    ref_list = []
    count = 0
    for article_idx in tqdm(article_idx_list):
        article = dev_json["data"][article_idx]
        for paragraph_idx in range(len(article["paragraphs"])):
            paragraph = article["paragraphs"][paragraph_idx]
            paragraph_context = paragraph["context"]
            for qas_idx in range(len(paragraph["qas"])):
                qas = paragraph["qas"][qas_idx]
                id = qas["id"]
                # print(ans)
                count += 1
                ref_list.append({'answers': qas["answers"], 'id': id})

    print(f"Total number of QA pair answered {count}")
    return ref_list

In [None]:
ref_list_first5 = build_ref_list(range(5))

100%|██████████| 5/5 [00:00<00:00, 402.58it/s]

Total number of QA pair answered 1877





In [None]:
ref_list_first48 = build_ref_list(range(48))

100%|██████████| 48/48 [00:00<00:00, 3501.03it/s]

Total number of QA pair answered 10570





In [None]:
from tqdm import tqdm

def test_pipeline(qa_pipeline_name, article_idx_list):
    pred_list = []
    count = 0
    for article_idx in tqdm(article_idx_list):
        article = dev_json["data"][article_idx]
        for paragraph_idx in range(len(article["paragraphs"])):
            paragraph = article["paragraphs"][paragraph_idx]
            paragraph_context = paragraph["context"]
            for qas_idx in range(len(paragraph["qas"])):
                qas = paragraph["qas"][qas_idx]
                question = qas['question']
                ans = get_QA(qa_pipeline_name, paragraph_context, question)
                id = qas["id"]
                # print(ans)
                count += 1
                pred_list.append({'prediction_text': f'{ans}', 'id': f'{id}'})

    print(f"Total number of QA pair answered {count}")
    return pred_list

In [None]:
from tqdm import tqdm

def test_pipeline(qa_pipeline_name, article_idx_list):
    pred_list = []
    sep_article_list = []
    count = 0
    for article_idx in tqdm(article_idx_list):
        article = dev_json["data"][article_idx]
        article_list = []
        for paragraph_idx in range(len(article["paragraphs"])):
            paragraph = article["paragraphs"][paragraph_idx]
            paragraph_context = paragraph["context"]
            for qas_idx in range(len(paragraph["qas"])):
                qas = paragraph["qas"][qas_idx]
                question = qas['question']
                ans = get_QA(qa_pipeline_name, paragraph_context, question)
                id = qas["id"]
                # print(ans)
                count += 1
                pred_list.append({'prediction_text': f'{ans}', 'id': f'{id}'})
                article_list.append({'prediction_text': f'{ans}', 'id': f'{id}'})
            
            sep_article_list.append(article_list)

    print(f"Total number of QA pair answered {count}")
    return pred_list, sep_article_list

In [None]:
pred_list_48_roberta = test_pipeline(qa_roberta, range(48))

In [None]:
pred_list_48_roberta = test_pipeline(qa_roberta, range(48))

In [None]:
pred_list_just4_roberta = test_pipeline(qa_roberta, [4])

In [None]:
! pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m72.7 MB/s

In [None]:
# Roberta 48
from datasets import load_metric

squad_metric = load_metric("squad")
predictions = pred_list_48_roberta
references = ref_list_first48
results = squad_metric.compute(predictions=predictions, references=references)
print(results)
# Output: {'exact_match': 0.0, 'f1': 0.0}

  squad_metric = load_metric("squad")


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

NameError: ignored

In [None]:
pred_list_48_distillbert = test_pipeline(distillbert_pipeline, range(48))

In [None]:
# distillBert 48
from datasets import load_metric

squad_metric = load_metric("squad")
predictions = pred_list_48_distillbert
references = ref_list_first48
results = squad_metric.compute(predictions=predictions, references=references)
print(results)
# Output: {'exact_match': 0.0, 'f1': 0.0}

# Shengqi's section

In [None]:
# distillBert 1
from datasets import load_metric

def squad_eval(pred_list, ref_list):
    squad_metric = load_metric("squad")
    predictions = pred_list
    references = ref_list
    results = squad_metric.compute(predictions=predictions, references=references)
    print(results)
    return results
# Output: {'exact_match': 0.0, 'f1': 0.0}

In [None]:
# You can also download the zip files manually and add them into certain dir
! unzip /content/drive/MyDrive/SpokenSQuAD/dev_1_2khz_1_transcripts_wer.zip
! unzip /content/drive/MyDrive/SpokenSQuAD/dev_1_44.1khz_2_transcripts_wer.zip
! unzip /content/drive/MyDrive/SpokenSQuAD/dev_1_4khz_1_transcripts_wer.zip
! unzip /content/drive/MyDrive/SpokenSQuAD/dev_44.1khz_classroom_snr_2_transcripts_wer.zip
! unzip /content/drive/MyDrive/SpokenSQuAD/dev_44.1khz_classroom_snr_5_transcripts_wer.zip

In [None]:
pd.read_csv("/content/content/drive/MyDrive/SpokenSQuAD/dev_1_2khz_1_transcripts_wer/article_0_transcripts_wer.csv")

In [None]:
import pandas as pd
import os

folder_path = '/content/content/drive/MyDrive/SpokenSQuAD/dev_44.1khz_classroom_snr_5_transcripts_wer'

# List all CSV files in the folder
csv_files = [file for file in os.listdir(folder_path) if file.endswith('.csv')]

# Initialize an empty DataFrame
combined_df = pd.DataFrame(columns=['article_idx', 'paragraph_idx', 'transcription', 'wer'])

# Read and concatenate CSV files
for csv_file in csv_files:
    file_path = os.path.join(folder_path, csv_file)
    df = pd.read_csv(file_path)
    combined_df = pd.concat([combined_df, df], ignore_index=True)

# Print the combined DataFrame
print(combined_df)

In [None]:
import numpy as np

np.array(combined_df["wer"].tolist()).mean()


0.07614826501828696

In [None]:
combined_df.to_csv('/content/drive/MyDrive/SpokenSQuAD/dev_44.1khz_classroom_snr_5_transcripts_wer.csv', index=False)
combined_df.to_csv('/content/dev_44.1khz_classroom_snr_5_transcripts_wer.csv', index=False)

In [None]:
# for filepath, use the corresponding transcripts_wer.csv path
def get_transcript(combined_df, article_idx, paragraph_idx):
    result = combined_df.query(f'article_idx == {article_idx} and paragraph_idx == {paragraph_idx}')

    if not result.empty:
        transcription = result['transcription'].iloc[0]
        return transcription
    else:
        print('ERROR: No transcription found for the specified article_idx and paragraph_idx.')


In [None]:
from tqdm import tqdm

def build_ref_list(article_idx_list):
    ref_list = []
    count = 0
    for article_idx in tqdm(article_idx_list):
        article = dev_json["data"][article_idx]
        for paragraph_idx in range(len(article["paragraphs"])):
            paragraph = article["paragraphs"][paragraph_idx]
            paragraph_context = paragraph["context"]
            for qas_idx in range(len(paragraph["qas"])):
                qas = paragraph["qas"][qas_idx]
                id = qas["id"]
                # print(ans)
                count += 1
                ref_list.append({'answers': qas["answers"], 'id': id})

    print(f"Total number of QA pair answered {count}")
    return ref_list

In [None]:
combined_df_2khz = pd.read_csv("/content/dev_1_2khz_1_transcripts_wer.csv")
get_transcript(combined_df_2khz, 0, 0)

" Super Bowl 50 was an American football game to determine the champion of the national football. Leeds, NFL, for the 2015 season. The American Football Conference, AFC, champion Denver Broncos defeated the National Football Conference. NFC, champion Carolina, pampers 24-10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California, as this was the 50th Super Bowl. The league emphasized the Golden Anniversary with Bay with Gold-Beamed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals, under which the game would have been known as Super Bowl R, so that the logo could prominently feature the Arabic numeral 50."

In [None]:
combined_df_44khz = pd.read_csv("/content/dev_1_44.1khz_2_transcripts_wer.csv")
get_transcript(combined_df_44khz, 0, 0)

" Super Bowl 50 was an American football game to determine the champion of the national football. League. NFL. For the 2015 season. The American Football Conference. AFC. Champion Denver Broncos defeated the National Football Conference. NFC. Champion Carolina Panthers 24-10 to earn their third Super Bowl title. The game was played on February 7, 2016. At Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl. The league emphasized the ''Golden Anniversary'' with various gold-themed initiatives. As well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals. Under which the game would have been known as ''Super Bowl L''. So that the logo could prominently feature the Arabic numerals 50"

In [None]:
from tqdm import tqdm

def test_pipeline(qa_pipeline_name, article_idx_list):
    pred_list = []
    sep_article_list = []
    count = 0
    for article_idx in tqdm(article_idx_list):
        article = dev_json["data"][article_idx]
        article_list = []
        for paragraph_idx in range(len(article["paragraphs"])):
            paragraph = article["paragraphs"][paragraph_idx]
            paragraph_context = paragraph["context"]
            for qas_idx in range(len(paragraph["qas"])):
                qas = paragraph["qas"][qas_idx]
                question = qas['question']
                ans = get_QA(qa_pipeline_name, paragraph_context, question)
                id = qas["id"]
                # print(ans)
                count += 1
                pred_list.append({'prediction_text': f'{ans}', 'id': f'{id}'})
                article_list.append({'prediction_text': f'{ans}', 'id': f'{id}'})
            
        sep_article_list.append(article_list)

    print(f"Total number of QA pair answered {count}")
    return pred_list, sep_article_list

In [None]:
from tqdm import tqdm

def test_pipeline_tran(combined_df, qa_pipeline_name, article_idx_list):
    pred_list = []
    sep_article_list = []
    count = 0
    for article_idx in tqdm(article_idx_list):
        article = dev_json["data"][article_idx]
        article_list = []
        for paragraph_idx in range(len(article["paragraphs"])):
            paragraph = article["paragraphs"][paragraph_idx]
            paragraph_context = get_transcript(combined_df, article_idx, paragraph_idx)
            for qas_idx in range(len(paragraph["qas"])):
                qas = paragraph["qas"][qas_idx]
                question = qas['question']
                ans = get_QA(qa_pipeline_name, paragraph_context, question)
                id = qas["id"]
                # print(ans)
                count += 1
                pred_list.append({'prediction_text': f'{ans}', 'id': f'{id}'})
                article_list.append({'prediction_text': f'{ans}', 'id': f'{id}'})
            
        sep_article_list.append(article_list)

    print(f"Total number of QA pair answered {count}")
    return pred_list, sep_article_list

In [None]:
pred_list_clean_distill, pred_sep_article_list_clean_distill = test_pipeline(distillbert_pipeline, range(1, 6))

100%|██████████| 5/5 [02:48<00:00, 33.73s/it]

Total number of QA pair answered 1335





In [None]:
ref_list_1to5 = build_ref_list(range(1, 6))

100%|██████████| 5/5 [00:00<00:00, 3058.86it/s]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_clean_distill, ref_list_1to5)

{'exact_match': 75.65543071161049, 'f1': 84.90251207594609}


In [None]:
ref_list_sep_article = []
for idx in range(1, 6):
    ref_list_one = build_ref_list([idx])
    ref_list_sep_article.append(ref_list_one)


100%|██████████| 1/1 [00:00<00:00, 2222.74it/s]


Total number of QA pair answered 247


100%|██████████| 1/1 [00:00<00:00, 2957.90it/s]


Total number of QA pair answered 112


100%|██████████| 1/1 [00:00<00:00, 1135.74it/s]


Total number of QA pair answered 511


100%|██████████| 1/1 [00:00<00:00, 3844.46it/s]


Total number of QA pair answered 197


100%|██████████| 1/1 [00:00<00:00, 2908.67it/s]

Total number of QA pair answered 268





In [None]:
df_4khz = pd.read_csv("/content/dev_1_4khz_1_transcripts_wer.csv")
pred_list_1_distillbert_4khz, pred_sep_article_list_1_distillbert_4khz = test_pipeline_tran(df_4khz, distillbert_pipeline, range(1, 6))

100%|██████████| 5/5 [02:59<00:00, 35.93s/it]

Total number of QA pair answered 1335





In [None]:
def eval_both(pred_list, pred_sep_article_list):
    res = squad_eval(pred_list, ref_list_1to5)
    for i in range(len(pred_sep_article_list)):
        print(f"Article {i + 1} result")
        res = squad_eval(pred_sep_article_list[i], ref_list_sep_article[i])


In [None]:
eval_both(pred_list_1_distillbert_4khz, pred_sep_article_list_1_distillbert_4khz)

{'exact_match': 47.41573033707865, 'f1': 62.038463174007504}
Article 1 result
{'exact_match': 50.202429149797574, 'f1': 63.315411084641866}
Article 2 result
{'exact_match': 42.857142857142854, 'f1': 58.59835600907028}
Article 3 result
{'exact_match': 42.465753424657535, 'f1': 59.55666245043249}
Article 4 result
{'exact_match': 45.17766497461929, 'f1': 62.1043498156069}
Article 5 result
{'exact_match': 57.83582089552239, 'f1': 66.9828899273588}


In [None]:
df_44khz = pd.read_csv("/content/dev_1_44.1khz_2_transcripts_wer.csv")
pred_list_1_distillbert_44khz, pred_sep_article_list_1_distillbert_44khz = test_pipeline_tran(df_44khz, distillbert_pipeline, range(1, 6))

100%|██████████| 5/5 [02:49<00:00, 33.98s/it]

Total number of QA pair answered 1335





In [None]:
eval_both(pred_list_1_distillbert_44khz, pred_sep_article_list_1_distillbert_44khz)

{'exact_match': 63.89513108614232, 'f1': 76.33880192152579}
Article 1 result
{'exact_match': 62.75303643724696, 'f1': 75.97445963842723}
Article 2 result
{'exact_match': 58.92857142857143, 'f1': 69.8483560090703}
Article 3 result
{'exact_match': 63.79647749510763, 'f1': 77.93858388222603}
Article 4 result
{'exact_match': 59.390862944162436, 'f1': 72.17506830450994}
Article 5 result
{'exact_match': 70.5223880597015, 'f1': 79.3973445586703}


In [None]:
pred_list_clean_distill, pred_sep_article_list_clean_distill = test_pipeline(distillbert_pipeline, range(1, 6))

100%|██████████| 5/5 [02:56<00:00, 35.35s/it]

Total number of QA pair answered 1335





In [None]:
eval_both(pred_list_clean_distill, pred_sep_article_list_clean_distill)

{'exact_match': 75.65543071161049, 'f1': 84.90251207594609}
Article 1 result
{'exact_match': 80.16194331983806, 'f1': 89.18600248154901}
Article 2 result
{'exact_match': 78.57142857142857, 'f1': 85.51729024943312}
Article 3 result
{'exact_match': 75.92954990215264, 'f1': 85.87258149578217}
Article 4 result
{'exact_match': 66.49746192893402, 'f1': 77.47099347353155}
Article 5 result
{'exact_match': 76.49253731343283, 'f1': 84.31081955924847}


In [None]:
df_4khz = pd.read_csv("/content/dev_1_4khz_1_transcripts_wer.csv")
pred_list_1_roberta_4khz, pred_sep_article_list_1_roberta_4khz = test_pipeline_tran(df_4khz, qa_roberta, range(1, 6))

100%|██████████| 5/5 [05:41<00:00, 68.26s/it]

Total number of QA pair answered 1335





In [None]:
eval_both(pred_list_1_roberta_4khz, pred_sep_article_list_1_roberta_4khz)

{'exact_match': 53.03370786516854, 'f1': 67.2481908693095}
Article 1 result
{'exact_match': 54.65587044534413, 'f1': 68.25619348334033}
Article 2 result
{'exact_match': 52.67857142857143, 'f1': 64.7236394557823}
Article 3 result
{'exact_match': 48.336594911937375, 'f1': 65.00950570220745}
Article 4 result
{'exact_match': 53.807106598984774, 'f1': 69.58675179039467}
Article 5 result
{'exact_match': 60.07462686567164, 'f1': 69.92373091253693}


In [None]:
df_44khz = pd.read_csv("/content/dev_1_44.1khz_2_transcripts_wer.csv")
pred_list_1_roberta_44khz, pred_sep_article_list_1_roberta_44khz = test_pipeline_tran(df_44khz, qa_roberta, range(1, 6))

100%|██████████| 5/5 [05:32<00:00, 66.53s/it]

Total number of QA pair answered 1335





In [None]:
eval_both(pred_list_1_roberta_44khz, pred_sep_article_list_1_roberta_44khz)

{'exact_match': 72.88389513108615, 'f1': 83.6576218294353}
Article 1 result
{'exact_match': 68.42105263157895, 'f1': 80.77974003980195}
Article 2 result
{'exact_match': 64.28571428571429, 'f1': 74.68679138321995}
Article 3 result
{'exact_match': 73.5812133072407, 'f1': 85.34583277688417}
Article 4 result
{'exact_match': 71.57360406091371, 'f1': 82.15314392912578}
Article 5 result
{'exact_match': 80.22388059701493, 'f1': 87.94596572581649}


In [None]:
pred_list_clean_roberta, pred_sep_article_list_clean_roberta = test_pipeline(qa_roberta, range(1, 6))

100%|██████████| 5/5 [05:48<00:00, 69.78s/it]

Total number of QA pair answered 1335





In [None]:
eval_both(pred_list_clean_roberta, pred_sep_article_list_clean_roberta)

{'exact_match': 85.31835205992509, 'f1': 91.86646198813034}
Article 1 result
{'exact_match': 89.47368421052632, 'f1': 94.38069532806375}
Article 2 result
{'exact_match': 83.92857142857143, 'f1': 88.54166666666667}
Article 3 result
{'exact_match': 83.7573385518591, 'f1': 91.6784543175127}
Article 4 result
{'exact_match': 83.248730964467, 'f1': 90.75669210120185}
Article 5 result
{'exact_match': 86.56716417910448, 'f1': 92.11294716891732}


Separation mark ======================================

In [None]:
res = squad_eval(pred_list_clean_distill, ref_list_1to5)

In [None]:
ref_list_sep_article = []
for idx in range(1, 6):
    ref_list_one = build_ref_list([idx])
    ref_list_sep_article.append(ref_list_one)

In [None]:
for i in range(len(pred_sep_article_list_clean_distill)):
    print(f"Article {i + 1} result")
    res = squad_eval(pred_sep_article_list_clean_distill[i], ref_list_sep_article[i])


Article 1 result
{'exact_match': 80.16194331983806, 'f1': 89.18600248154901}
Article 2 result
{'exact_match': 78.57142857142857, 'f1': 85.51729024943312}
Article 3 result
{'exact_match': 75.92954990215264, 'f1': 85.87258149578217}
Article 4 result
{'exact_match': 66.49746192893402, 'f1': 77.47099347353155}
Article 5 result
{'exact_match': 76.49253731343283, 'f1': 84.31081955924847}


In [None]:
len(pred_sep_article_list_clean_distill)

292

In [None]:
len(ref_list_sep_article)

5

In [None]:
from tqdm import tqdm

def test_pipeline_tran(combined_df, qa_pipeline_name, article_idx_list):
    pred_list = []
    count = 0
    for article_idx in tqdm(article_idx_list):
        article = dev_json["data"][article_idx]
        for paragraph_idx in range(len(article["paragraphs"])):
            paragraph = article["paragraphs"][paragraph_idx]
            paragraph_context = get_transcript(combined_df, article_idx, paragraph_idx)
            for qas_idx in range(len(paragraph["qas"])):
                qas = paragraph["qas"][qas_idx]
                question = qas['question']
                ans = get_QA(qa_pipeline_name, paragraph_context, question)
                id = qas["id"]
                # print(ans)
                count += 1
                pred_list.append({'prediction_text': f'{ans}', 'id': f'{id}'})

    print(f"Total number of QA pair answered {count}")
    return pred_list

In [None]:
df_2khz = pd.read_csv("/content/dev_1_2khz_1_transcripts_wer.csv")
pred_list_1_distillbert_2khz = test_pipeline_tran(df_2khz, distillbert_pipeline, range(1))

100%|██████████| 1/1 [01:27<00:00, 87.65s/it]

Total number of QA pair answered 810





In [None]:
ref_list_first1 = build_ref_list(range(1))

100%|██████████| 1/1 [00:00<00:00, 759.15it/s]

Total number of QA pair answered 810





In [None]:
# distillBert 1
from datasets import load_metric

squad_metric = load_metric("squad")
predictions = pred_list_1_distillbert_2khz
references = ref_list_first1
results = squad_metric.compute(predictions=predictions, references=references)
print(results)
# Output: {'exact_match': 0.0, 'f1': 0.0}

{'exact_match': 54.44444444444444, 'f1': 66.36494367023877}


In [None]:
res = squad_eval(pred_list_1_distillbert_2khz, ref_list_first1)

{'exact_match': 54.44444444444444, 'f1': 66.36494367023877}


In [None]:
df_4khz = pd.read_csv("/content/dev_1_4khz_1_transcripts_wer.csv")
pred_list_1_distillbert_4khz = test_pipeline_tran(df_4khz, distillbert_pipeline, range(1))

100%|██████████| 1/1 [01:31<00:00, 91.62s/it]

Total number of QA pair answered 810





In [None]:
res2 = squad_eval(pred_list_1_distillbert_4khz, ref_list_first1)

{'exact_match': 48.888888888888886, 'f1': 61.633997125282576}


In [None]:
df_4khz = pd.read_csv("/content/dev_1_4khz_1_transcripts_wer.csv")
pred_list_1_roberta_4khz = test_pipeline_tran(df_4khz, qa_roberta, range(5))

100%|██████████| 5/5 [06:43<00:00, 80.71s/it]

Total number of QA pair answered 1877





In [None]:
ref_list_first5 = build_ref_list(range(5))

100%|██████████| 5/5 [00:00<00:00, 1985.94it/s]

Total number of QA pair answered 1877





In [None]:
res = squad_eval(pred_list_1_roberta_4khz, ref_list_first5)

{'exact_match': 53.22322855620671, 'f1': 67.44157347361114}


In [None]:
df_2khz = pd.read_csv("/content/dev_1_2khz_1_transcripts_wer.csv")
pred_list_1_roberta_2khz = test_pipeline_tran(df_2khz, qa_roberta, range(5))

100%|██████████| 5/5 [06:07<00:00, 73.54s/it]

Total number of QA pair answered 1877





In [None]:
res = squad_eval(pred_list_1_roberta_2khz, ref_list_first5)

{'exact_match': 57.00586041555674, 'f1': 70.34709404529723}


In [None]:
df_44khz = pd.read_csv("/content/dev_1_44.1khz_2_transcripts_wer.csv")
pred_list_1_roberta_44khz = test_pipeline_tran(df_44khz, qa_roberta, range(5))

100%|██████████| 5/5 [06:13<00:00, 74.63s/it]

Total number of QA pair answered 1877





In [None]:
res = squad_eval(pred_list_1_roberta_44khz, ref_list_first5)

{'exact_match': 74.85348961108151, 'f1': 84.24917991552333}


In [None]:
import numpy as np

np.array(df_4khz["wer"].tolist()).mean()

0.14872751481321927

# REAL Evaluation starts here!

Here, we only use articles 1-5 for evaluation

In [None]:
ref_list_1to5 = build_ref_list(range(1,6))

100%|██████████| 5/5 [00:00<00:00, 2187.72it/s]

Total number of QA pair answered 1335





2khz track

In [None]:
df_2khz = pd.read_csv("/content/dev_1_2khz_1_transcripts_wer.csv")

In [None]:
pred_list_roberta_2khz = test_pipeline_tran(df_2khz, qa_roberta, range(1, 6))

100%|██████████| 5/5 [05:17<00:00, 63.60s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_roberta_2khz, ref_list_1to5)

{'exact_match': 57.453183520599254, 'f1': 71.07357533205786}


In [None]:
pred_list_distillbert_2khz = test_pipeline_tran(df_2khz, distillbert_pipeline, range(1, 6))

100%|██████████| 5/5 [02:23<00:00, 28.69s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_distillbert_2khz, ref_list_1to5)

{'exact_match': 50.71161048689139, 'f1': 65.73171074582916}


4khz track

In [None]:
df_4khz = pd.read_csv("/content/dev_1_4khz_1_transcripts_wer.csv")

In [None]:
pred_list_roberta_4khz = test_pipeline_tran(df_4khz, qa_roberta, range(1, 6))

100%|██████████| 5/5 [04:25<00:00, 53.10s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_roberta_4khz, ref_list_1to5)

{'exact_match': 53.03370786516854, 'f1': 67.2481908693095}


In [None]:
pred_list_distillbert_4khz = test_pipeline_tran(df_4khz, distillbert_pipeline, range(1, 6))

100%|██████████| 5/5 [02:12<00:00, 26.59s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_distillbert_4khz, ref_list_1to5)

{'exact_match': 47.41573033707865, 'f1': 62.038463174007504}


In [None]:
ref_list_just4 = build_ref_list([4])

100%|██████████| 1/1 [00:00<00:00, 2305.83it/s]

Total number of QA pair answered 197





In [None]:
pred_list_roberta_4khz_just4 = test_pipeline_tran(df_4khz, qa_roberta, [4])
res = squad_eval(pred_list_roberta_4khz_just4, ref_list_just4)

100%|██████████| 1/1 [00:49<00:00, 49.54s/it]


Total number of QA pair answered 197
{'exact_match': 53.807106598984774, 'f1': 69.58675179039467}


In [None]:
pred_list_distillbert_4khz_just4 = test_pipeline_tran(df_4khz, distillbert_pipeline, [4])
res = squad_eval(pred_list_distillbert_4khz_just4, ref_list_just4)

100%|██████████| 1/1 [00:24<00:00, 24.87s/it]


Total number of QA pair answered 197
{'exact_match': 45.17766497461929, 'f1': 62.1043498156069}


44.1 khz track

In [None]:
df_44khz = pd.read_csv("/content/dev_1_44.1khz_2_transcripts_wer.csv")

In [None]:
pred_list_roberta_44khz = test_pipeline_tran(df_44khz, qa_roberta, range(1, 6))

100%|██████████| 5/5 [04:16<00:00, 51.37s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_roberta_44khz, ref_list_1to5)

{'exact_match': 72.88389513108615, 'f1': 83.6576218294353}


In [None]:
pred_list_distillbert_44khz = test_pipeline_tran(df_44khz, distillbert_pipeline, range(1, 6))

100%|██████████| 5/5 [02:08<00:00, 25.78s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_distillbert_44khz, ref_list_1to5)

{'exact_match': 63.89513108614232, 'f1': 76.33880192152579}


44.1 khz snr=2 track

In [None]:
df_44khz_snr2 = pd.read_csv("/content/dev_44.1khz_classroom_snr_2_transcripts_wer.csv")

In [None]:
pred_list_roberta_44khz_snr2 = test_pipeline_tran(df_44khz_snr2, qa_roberta, range(1, 6))

100%|██████████| 5/5 [04:15<00:00, 51.18s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_roberta_44khz_snr2, ref_list_1to5)

{'exact_match': 66.06741573033707, 'f1': 78.86457662921637}


In [None]:
pred_list_distillbert_44khz_snr2 = test_pipeline_tran(df_44khz_snr2, distillbert_pipeline, range(1, 6))

100%|██████████| 5/5 [02:07<00:00, 25.50s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_distillbert_44khz_snr2, ref_list_1to5)

{'exact_match': 58.12734082397004, 'f1': 72.57495400499408}


44.1 khz snr=5 track

In [None]:
df_44khz_snr5 = pd.read_csv("/content/dev_44.1khz_classroom_snr_5_transcripts_wer.csv")

In [None]:
pred_list_roberta_44khz_snr5 = test_pipeline_tran(df_44khz_snr5, qa_roberta, range(1, 6))

100%|██████████| 5/5 [04:15<00:00, 51.15s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_roberta_44khz_snr5, ref_list_1to5)

{'exact_match': 67.79026217228464, 'f1': 79.7800243051419}


In [None]:
pred_list_distillbert_44khz_snr5 = test_pipeline_tran(df_44khz_snr5, distillbert_pipeline, range(1, 6))

100%|██████████| 5/5 [02:07<00:00, 25.47s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_distillbert_44khz_snr5, ref_list_1to5)

{'exact_match': 60.0749063670412, 'f1': 73.91309731739555}


clean (pure transcript) track

In [None]:
pred_list_clean_roberta = test_pipeline(qa_roberta, range(1, 6))

100%|██████████| 5/5 [04:23<00:00, 52.78s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_clean_roberta, ref_list_1to5)

{'exact_match': 85.31835205992509, 'f1': 91.86646198813034}


In [None]:
pred_list_clean_distillbert = test_pipeline(distillbert_pipeline, range(1, 6))

100%|██████████| 5/5 [02:10<00:00, 26.14s/it]

Total number of QA pair answered 1335





In [None]:
res = squad_eval(pred_list_clean_distillbert, ref_list_1to5)

{'exact_match': 75.65543071161049, 'f1': 84.90251207594609}


In [None]:
ref_list_just4 = build_ref_list([4])

100%|██████████| 1/1 [00:00<00:00, 361.92it/s]

Total number of QA pair answered 197





In [None]:
pred_list_clean_roberta_just4 = test_pipeline(qa_roberta, [4])

100%|██████████| 1/1 [00:49<00:00, 49.70s/it]

Total number of QA pair answered 197





In [None]:
res = squad_eval(pred_list_clean_roberta_just4, ref_list_just4)

{'exact_match': 83.248730964467, 'f1': 90.75669210120185}


# Previous Block

In [None]:
pred_list_5_roberta = test_pipeline(qa_roberta, range(5))

100%|██████████| 5/5 [06:13<00:00, 74.72s/it]

Total number of QA pair answered 1877





In [None]:
pred_list_5_distillbert = test_pipeline(distillbert_pipeline, range(5))

100%|██████████| 5/5 [03:08<00:00, 37.61s/it]

Total number of QA pair answered 1877





In [None]:
pred_list_5_roberta[:5]

[{'prediction_text': 'Denver Broncos', 'id': '56be4db0acb8001400a502ec'},
 {'prediction_text': 'Carolina Panthers', 'id': '56be4db0acb8001400a502ed'},
 {'prediction_text': "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California",
  'id': '56be4db0acb8001400a502ee'},
 {'prediction_text': 'Denver Broncos', 'id': '56be4db0acb8001400a502ef'},
 {'prediction_text': 'gold', 'id': '56be4db0acb8001400a502f0'}]

In [None]:
print(dev_json["data"][0]["paragraphs"][0]["qas"][0])

{'answers': [{'answer_start': 177, 'text': 'Denver Broncos'}, {'answer_start': 177, 'text': 'Denver Broncos'}, {'answer_start': 177, 'text': 'Denver Broncos'}], 'question': 'Which NFL team represented the AFC at Super Bowl 50?', 'id': '56be4db0acb8001400a502ec'}


In [None]:
ref_list_first5

In [None]:
!pip install datasets

In [None]:
from datasets import load_metric

squad_metric = load_metric("squad")
predictions = pred_list_5_roberta
references = ref_list_first5
results = squad_metric.compute(predictions=predictions, references=references)
print(results)
# Output: {'exact_match': 0.0, 'f1': 0.0}

{'exact_match': 87.74640383590837, 'f1': 92.74996739314979}


In [None]:
from datasets import load_metric

squad_metric = load_metric("squad")
predictions = pred_list_5_distillbert
references = ref_list_first5
results = squad_metric.compute(predictions=predictions, references=references)
print(results)
# Output: {'exact_match': 0.0, 'f1': 0.0}

{'exact_match': 78.58284496537027, 'f1': 85.80494192372488}


In [None]:
from datasets import load_metric

squad_metric = load_metric("squad")
predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22'}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
results = squad_metric.compute(predictions=predictions, references=references)
print(results)
# Output: {'exact_match': 0.0, 'f1': 0.0}

  squad_metric = load_metric("squad")


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

{'exact_match': 0.0, 'f1': 0.0}


In [None]:
from evaluate import load
squad_metric = load("squad")
predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22'}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
results = squad_metric.compute(predictions=predictions, references=references)
results
{'exact_match': 0.0, 'f1': 0.0}